Since Luminous release BlueStore is new and the default backend for Ceph OSDs. Previous release uses FileStore backend which stored Ceph object as files utilizing underlying File System usually XFS. With BlueStore Ceph developer decided not use any OS level FileSystems and create own Ceph only FS: BlueFS.
So what is the benefit of making such radical change ?
First of all Ceph developers ensure that its at least twice faster that traditional FileStore and with less latency. So how they have achieved that ?
- For large files FileStore do double writes: One write to Journal and second to actual storage. Of course most of deployments uses Journal on SSD drives, but in most cases it does not saves much. With Small sequential writes this makes even more sense.
- For key/value data : Some RGW workloads, There may be even up to 3x improvement of performance.
- Throughput collapse seen in FileStore because of splittings with clusters filled with lots of data is also solved with BlueStore.
- Librados small sequential reads are little slower with BlueStore , but in most cases its benchmark only use-case so most of real life users should not notice that . BlueStore doesn’t implement its own readahead because everything on top of RADOS have it's oen su sequentia reads via higher-level interfaces (RBD and CephFS) are generally way better.
- BlueStore is copy-on-write: So performance with snapshots is way better.
How BlueStore works?
By refusing from using traditional file systems, thus getting rid of POSIX compliancy Ceph developers make own clean implementation of ObjectStore interface which excat;y matches Ceph workloads. BlueStore works directly with underlying block device(s) and embeds RocksDB key-value database to maintain internal metadata. BlueFS internal adapter implements minimal file system like interface to allow RocksDB to store "files" on it and it shares same raw device with mail BlueStore.
The biggest difference between FileStore and BlueStore servers is how partitions of disks are created and mounted.
With FileStore you will see something similar :
/dev/sdb1 3.7T 853G 2.9T 23% /opt/rados/sdb1 /dev/sdc1 3.7T 937G 2.8T 26% /opt/rados/sdc1 /dev/sdd1 3.7T 806G 2.9T 22% /opt/rados/sdd1
Block devices are mounted to some directory and holds date, quite classic and understandable.
With BlueStore picture is cordially different:
tmpfs 12G 48K 12G 1% /var/lib/ceph/osd/ceph-6 tmpfs 12G 48K 12G 1% /var/lib/ceph/osd/ceph-9 tmpfs 12G 48K 12G 1% /var/lib/ceph/osd/ceph-10
These are small partitions to hold journals, but if you go ahead and use separate block devices for journals (For example high speed SSD) you will see that that these disks have really tinny disk usage and inside dirs you will see links which just points block and block.db to real block devices which actually holds the data.
l /var/lib/ceph/osd/ceph-6 total 52K drwxrwxrwt 2 ceph ceph 320 Jan 28 08:37 . drwxr-xr-x 5 ceph ceph 4.0K Jan 28 08:37 .. -rw-r--r-- 1 ceph ceph 420 Jan 28 08:37 activate.monmap lrwxrwxrwx 1 ceph ceph 25 Jan 28 08:37 block -> /dev/ceph-block-0/block-0 lrwxrwxrwx 1 ceph ceph 19 Jan 28 08:37 block.db -> /dev/ceph-db-0/db-0 -rw-r--r-- 1 ceph ceph 2 Jan 28 08:37 bluefs -rw-r--r-- 1 ceph ceph 37 Jan 28 08:37 ceph_fsid -rw-r--r-- 1 ceph ceph 37 Jan 28 08:37 fsid -rw------- 1 ceph ceph 55 Jan 28 08:37 keyring -rw-r--r-- 1 ceph ceph 8 Jan 28 08:37 kv_backend -rw-r--r-- 1 ceph ceph 21 Jan 28 08:37 magic -rw-r--r-- 1 ceph ceph 4 Jan 28 08:37 mkfs_done -rw-r--r-- 1 ceph ceph 41 Jan 28 08:37 osd_key -rw-r--r-- 1 ceph ceph 6 Jan 28 08:37 ready -rw-r--r-- 1 ceph ceph 10 Jan 28 08:37 type -rw-r--r-- 1 ceph ceph 2 Jan 28 08:37 whoami
BlueStore can run with combination of slow and fast devices, similar to FileStore, But BlueStore will do much more effective usage of fast devices. Need for journaling in BlueStore much less and behaves it self much like metadata journal with doing very small writes.
In FileStore, the journal device (often placed on a faster SSD) is only used for writes. In BlueStore, the internal journaling needed for consistency is much lighter-weight, usually behaving like a metadata journal and only journaling small writes when it is faster (or necessary) to do so. The rest of the fast device can be used to store (and retrieve) internal metadata.
BlueStore manages up to 3 devices:
- The required main device for object storage and usually metadata.
- An optional db for storing as much RocksDB metadata as will fill . Whatever doesn’t fit will spill back onto the main device.
- An optional WAL device stores just the internal journal (the RocksDB write-ahead log).
Here is a great manual about configuring OSDs with multiple devices by utilizing combination of solid state and magnetic devices.
At OddEye we have created small cluster with following configuration
- 3 OSD nodes: Xeon 1230v5 and 16GB RAM (rados000, rados001, rados002)
- 3x4TB SATA disks per OSD per node
- 1x480GB SSD for Journal(s)
2 of these nodes (rados000, rados001) were using FileStore, third one (rados000, rados001) uses Blue store.
We have filled out cluster with plenty of data and started to play with reads and writes using rados binary.
First amazing result was with IO usage of disks :
You can see that for same job (random parallel reads and writes of lots of files of different sizes ) BlueStore (rados002) uses about 30% if its disks IO resources, while FileStore servers are near to 100%.
Because of this total Load Average of servers was also sightly different :
While disk reads were almost the same for all devices :
Disk writes were quite different :
BlueStore have huge advantage against FileStore in terms of performance,disk I/O utilization. Its highly configurable and provides more robustness for system Administrators. We do not have enough information about its stability, but with internal tests we saw no issues with it. So it looks quite solid and productions ready solution for next generation Ceph clusters.