☁ ~ Sébastien Han
☁ ~ French Cloud Engineer working for eNovance
☁ ~ Daily job focused on Ceph and OpenStack
☁ ~ Blogger
Personal blog: http://www.sebastien-han.fr/blog/
Company blog: http://techs.enovance.com/
How does Ceph perform?
Essential IO knowledge
A single write...
A single object write with a replica of 2 leads to 4 IOs:
➜ Client sends his request to the primary OSD
➜ First IO is written to the Ceph journal with o_direct and aio
➜ The filestore flushes the IO to the backend filesystem
The same process is repeated to the secondary OSD.
Not applicable if the OSD filesystem is Btrfs.
About the journal
What does it do?
➜ Ensures consistency
➜ Provides atomic transactions
➜ Writes sequential IOs
➜ Works as a FIFO and a common journal (replay)
Ceph OSD Daemon stops writes and synchronises the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space.
Journal and OSD data on the same disk
Journal penalty on the disk
Since we write twice, if the journal is stored on the same disk as the osd data this will result in the following:
Device: wMB/s sdb1 - journal 50.11 sdb2 - osd_data 40.25
IO life cycle overview
What are they?
➜ Shards of pool
➜ Contains objects
➜ Bond to N OSD where N is the replica count
➜ Amount of data within the pool
➜ Memory used by the scrubbing process
➜ Load generated on the monitors
One more thing about the Journal...
If you lose the journal
you lose the OSD
How to build it?
How to start?
Things that you must consider:
➜ Use case
- IO profile: Bandwidth? IOPS? Mixed?
How many IOPS or Bandwidth per client do I want to deliver?
- Do I use Ceph in standalone or is it combined with a software solution?
➜ Amount of data (usable not RAW)
- Replica count
How much data do I need to re-balance if a node fail?
- Do I have a data growth planning?
➜ Budget :-)
Things that you must not do
➜ Don't put a RAID underneath your OSD
- Ceph already manages the replication
- Degraded RAID breaks performances
- Reduce usable space on the cluster
➜ Don't build high density nodes with a tiny cluster
- Failure consideration and data to re-balance
- Potential full cluster
➜ Don't run Ceph on your hypervisors (unless you're broke)
For a perfect start
Workaround the journal design
The main goal here is to workaround the penalty introduced by the journal. Remember we write twice!
➜ Get transparent performances
➜ Reduce journal impact on the overall performance
Ways to implement it
➜ File on the OSD data disk
- Filesystem overhead
➜ Block on the OSD data disk
Create a tiny partition using the first sectors
➜ Separate spinning disk
- Good but the disk seeks a lot (concurrent journal writes)
➜ Separate SSD disk
- Best way, no seek, fast sequential writes, fast access times
It's fairly easy to overcome a 1Gb link with a sequential workload.
➜ Link speed:
- Link >1Gb is a must-have since 1Gb ~= 1 hdd
- however if you only do IOPS 1Gb ~= 1/2 SSDs
- 10Gb ethernet
- Infiniband 40Gb or 56Gb - QDR.
➜ Network optimisations:
- Jumbo frame (better CPU/bandwidth ratio)
- Non-blocking network switch back-plane
- = link speed * nb_servers
Public and cluster networks
Using a single NIC will lead to client writes and OSD replication going to the same link
➜ 1 network for the client to OSD communications
➜ 1 network dedicated for the replication between OSDs
The more RAM you have the better caching you get!
General RAM recommendations:
➜ ~200MB per OSD daemon for common operation
➜ ~500MB - 1GB per OSD daemon during recovery
➜ The rest goes for the page cache
- Hit caching = full network bandwidth
Misaligned partitions could lead to a huge performance degradation, overbearing workload. Thus SSDs tend to die more quickly.
➜ GPT partitions
➜ Partition alignment (parted -a optimal)
➜ Same goes for LVM based journals!
Which filesystem should I use?
Schools of thought:
➜ Traditional filesystems:
- Requirement: extented attributes support
- Supported: ext4, XFS
- Journal mode: writeahead
➜ COW filesystems:
- Supported: Brtfs but not production ready
Journal mode: writeparallel
Too many objects in a PG leads to a bad data balancing and disturbs the scrubbing process
➜ PG sum = (OSDs * 100 / max_rep_count)
➜ PGs per pool = ((OSDs * 100 / max_rep_count) / NB_POOLS)
Made a mistake? Or want to scale up?
$ ceph osd pool set rbd pg_num 128
If I had to build a Ceph cluster I would...
Common Cloud use case
➜ Use case:
- OpenStack Private Cloud - Glance and Cinder
- Poor librbd caching implementation
- IO profile: mixed workload
➜ 60 TB usable
- Replica count of 2
- Plan is to get 6 more usable Tera Byte every 6 months
➜ Budget has been established :-)
A too small journal will lead to more frequent flushs, does the backed can keep-up, does the SSD itself can?
journal size = (2 * (expected bandwidth throughput * filestore max sync interval))
For a well optimised 10GB network:
12,5GB = (2 * (1250 * 5))
Journals per SSD?
Using a SSD to store the journal implies that the SSD must be capable to sustain all the journal writes
Journal number = (SSD seq write speed) / (spinning disk seq write speed)
Example with an enterprise level SSD (~340 MB/s seq writes with o_direct and d_sync) and an enterprise level spinning disk (110 MB/s seq writes).
340/110 = 3,1
So 3 sounds reasonable and a good balance between performance and OSD loss.
Storage disks per machines?
Every disks throughput decreases your network bandwidth.
Remember since I decided to put 3 journals per SSD, my SSD bandwidth is up to 113MB/sec for each journal.
My 3:1 SSD per OSD ratio:
➜ 10G network with SSD journals
= 1250MB/s / 12 gives you 104MB/s
The storage server
In the end, a good Ceph server looks like (for me):
➜ Two 6-Cores/12Threads Intel Xeon CPU E5-2630L
➜ 32 GB RAM
➜ 2x 146G SAS 15K RPM (RAID 1 for the system)
➜ 3x SSDs Intel® Solid-State Drive 520 Series
- Model 120GB
- 50K IOPS and 500MB/s
➜ 9x Seagate® Constellation SAS 1TB @ 7.2K RPM 2,5"
➜ 2x 10GB NICs
And I'll configure Ceph like this:
➜ XFS as OSD filesystem
➜ 120 OSDs (JBOD disks)
➜ 3000 PGs for Cinder
➜ 3000 PGs for Glance
So now, you got an answer
➜ Rsocket? Somewhere next year?
➜ ZFS will be a good candidate in the near future.
➜ Btrfs is the way to go as soon as it gets production ready.
A future presentation will introduce:
➜ Benchmark methodology
➜ Dive into benchmarking tools
➜ Interpret the results!
➜ Real case study