Ceph Performance










Whoami


☁  ~ Sébastien Han
☁  ~ French Cloud Engineer working for eNovance
☁  ~ Daily job focused on Ceph and OpenStack
☁  ~ Blogger





Twitter: @sebastien_han


How does Ceph perform?


42*










Ceph basics


Essential IO knowledge








A single write...



A single object write with a replica of 2 leads to 4 IOs:

     ➜ Client sends his request to the primary OSD
     ➜ First IO is written to the Ceph journal with o_direct and aio
     ➜ The filestore flushes the IO to the backend filesystem

The same process is repeated to the secondary OSD.



Not applicable if the OSD filesystem is Btrfs.


About the journal



What does it do?

     ➜ Ensures consistency
     ➜ Provides atomic transactions
     ➜ Writes sequential IOs
     ➜ Works as a FIFO and a common journal (replay)

Ceph OSD Daemon stops writes and synchronises the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space.



Journal and OSD data on the same disk


Journal penalty on the disk

Since we write twice, if the journal is stored on the same disk as the osd data this will result in the following:

Device:             wMB/s
sdb1 - journal      50.11
sdb2 - osd_data     40.25





IO life cycle overview


             

Placement group



What are they?

     ➜ Shards of pool
     ➜ Contains objects
     ➜ Bond to N OSD where N is the replica count

Considerations:

     ➜ Amount of data within the pool
     ➜ Memory used by the scrubbing process
     ➜ Load generated on the monitors




One more thing about the Journal...


If you lose the journal 

you lose the OSD











CLUSTER


How to build it?







How to start?


Things that you must consider:

     ➜ Use case 
      • IO profile: Bandwidth? IOPS? Mixed?
        • How many IOPS or Bandwidth per client do I want to deliver?
      • Do I use Ceph in standalone or is it combined with a software solution?

     ➜ Amount of data (usable not RAW)
      • Replica count
      • How much data do I need to re-balance if a node fail?
      • Do I have a data growth planning?

     ➜ Budget  :-)



Things that you must not do



     ➜ Don't put a RAID underneath your OSD
      • Ceph already manages the replication
      • Degraded RAID breaks performances
      • Reduce usable space on the cluster

     ➜ Don't build high density nodes with a tiny cluster
      • Failure consideration and data to re-balance
      • Potential full cluster

     ➜ Don't run Ceph on your hypervisors (unless you're broke)






Best practices


For a perfect start








Workaround the journal design


The main goal here is to workaround the penalty introduced by the journal. Remember we write twice!

Objectives:

     ➜ Get transparent performances
     ➜ Reduce journal impact on the overall performance






Ways to implement it



How to:

     ➜ File on the OSD data disk
      • Filesystem overhead

     ➜ Block on the OSD data disk
      • Create a tiny partition using the first sectors

     ➜ Separate spinning disk
      • Good but the disk seeks a lot (concurrent journal writes)

     ➜  Separate SSD disk
      • Best way, no seek, fast sequential writes, fast access times

Network design


It's fairly easy to overcome a 1Gb link with a sequential workload.

     ➜ Link speed:
      • Link >1Gb is a must-have since 1Gb ~= 1 hdd
        • however if you only do IOPS 1Gb ~= 1/2 SSDs
      • 10Gb ethernet
      • Infiniband 40Gb or 56Gb - QDR.

     ➜ Network optimisations:
      • Jumbo frame (better CPU/bandwidth ratio)
      • Non-blocking network switch back-plane
        • = link speed * nb_servers

Public and cluster networks


Using a single NIC will lead to client writes and OSD replication going to the same link

General recommendations:

     ➜ 1 network for the client to OSD communications
     ➜ 1 network dedicated for the replication between OSDs







RAM


The more RAM you have the better caching you get!

General RAM recommendations:

     ➜ ~200MB per OSD daemon for common operation
     ➜ ~500MB - 1GB per OSD daemon during recovery
     ➜ The rest goes for the page cache
      • Hit caching = full network bandwidth





Partitions

Misaligned partitions could lead to a huge performance degradation, overbearing workload. Thus SSDs tend to die more quickly.

     ➜ GPT partitions
     ➜ Partition alignment (parted -a optimal)
     ➜ Same goes for LVM based journals!

Filesystems


Which filesystem should I use?

Schools of thought:
     ➜ Traditional filesystems:
      • Requirement: extented attributes support
      • Supported: ext4, XFS
      • Journal mode: writeahead

     ➜ COW filesystems:
      • Supported: Brtfs but not production ready
      • Journal mode: writeparallel



PG number


Too many objects in a PG leads to a bad data balancing and disturbs the scrubbing process

General recommendations:

     ➜ PG sum = (OSDs * 100 / max_rep_count)
     ➜ PGs per pool = ((OSDs * 100 / max_rep_count) / NB_POOLS)


Made a mistake? Or want to scale up?

$ ceph osd pool set rbd pg_num 128 




Mathematical approach

If I had to build a Ceph cluster I would...






Common Cloud use case



     ➜ Use case: 
      • OpenStack Private Cloud - Glance and Cinder
      • Poor librbd caching implementation
      • IO profile: mixed workload

     ➜ 60 TB usable
      • Replica count of 2
      • Plan is to get 6 more usable Tera Byte every 6 months

     ➜ Budget has been established :-)



Journal size


A too small journal will lead to more frequent flushs, does the backed can keep-up, does the SSD itself can?

Established formula:

journal size = (2 * (expected bandwidth throughput * filestore max sync interval)) 

For a well optimised 10GB network:

12,5GB = (2 * (1250 * 5)) 


Journals per SSD?


Using a SSD to store the journal implies that the SSD must be capable to sustain all the journal writes

Established formula:
Journal number = (SSD seq write speed) / (spinning disk seq write speed) 

Example with an enterprise level SSD (~340 MB/s seq writes with o_direct and d_sync) and an enterprise level spinning disk (110 MB/s seq writes).
340/110 = 3,1


So 3 sounds reasonable and a good balance between performance and OSD loss.

Storage disks per machines?


Every disks throughput decreases your network bandwidth.


Remember since I decided to put 3 journals per SSD, my SSD bandwidth is up to 113MB/sec for each journal.


My 3:1 SSD per OSD ratio:

    ➜ 10G network with SSD journals
= 1250MB/s / 12 gives you 104MB/s




The storage server


In the end, a good Ceph server looks like (for me):

     ➜ Two 6-Cores/12Threads Intel Xeon CPU E5-2630L 
     ➜  32 GB RAM
     ➜  2x 146G SAS 15K RPM (RAID 1 for the system)
     ➜  3x SSDs Intel® Solid-State Drive 520 Series
      • Model 120GB
      • 50K IOPS and 500MB/s

     ➜  9x Seagate® Constellation SAS 1TB @ 7.2K RPM 2,5"
     ➜  2x 10GB NICs



Ceph config



And I'll configure Ceph like this:

     ➜ XFS as OSD filesystem
     ➜ 120 OSDs (JBOD disks)
     ➜ 3000 PGs for Cinder
     ➜ 3000 PGs for Glance









So now, you got an answer












Perspective



     ➜ Rsocket? Somewhere next year? 
      ZFS will be a good candidate in the near future.
     ➜ Btrfs is the way to go as soon as it gets production ready.










Coming soon



A future presentation will introduce:

     ➜ Benchmark methodology
     ➜ Dive into benchmarking tools
     ➜ Interpret the results!
     ➜ Real case study









Many thanks!


Questions?








Ceph Performand and Benchmarking

By Sébastien Han

Ceph Performand and Benchmarking

  • 34,317