Ceph Performance


☁  ~ Sébastien Han
☁  ~ French Cloud Engineer working for eNovance
☁  ~ Daily job focused on Ceph and OpenStack
☁  ~ Blogger

Twitter: @sebastien_han
Personal blog: http://www.sebastien-han.fr/blog/
Company blog: http://techs.enovance.com/

How does Ceph perform?


*The Hitchhiker's Guide to the Galaxy

Ceph basics

Essential IO knowledge

A single write...

A single object write with a replica of 2 leads to 4 IOs:

     ➜ Client sends his request to the primary OSD
     ➜ First IO is written to the Ceph journal with o_direct and aio
     ➜ The filestore flushes the IO to the backend filesystem

The same process is repeated to the secondary OSD.

Not applicable if the OSD filesystem is Btrfs.

About the journal

What does it do?

     ➜ Ensures consistency
     ➜ Provides atomic transactions
     ➜ Writes sequential IOs
     ➜ Works as a FIFO and a common journal (replay)

Ceph OSD Daemon stops writes and synchronises the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space.

Journal and OSD data on the same disk

Journal penalty on the disk

Since we write twice, if the journal is stored on the same disk as the osd data this will result in the following:

Device:             wMB/s
sdb1 - journal      50.11
sdb2 - osd_data     40.25

IO life cycle overview


Placement group

What are they?

     ➜ Shards of pool
     ➜ Contains objects
     ➜ Bond to N OSD where N is the replica count


     ➜ Amount of data within the pool
     ➜ Memory used by the scrubbing process
     ➜ Load generated on the monitors

One more thing about the Journal...

If you lose the journal 

you lose the OSD


How to build it?

How to start?

Things that you must consider:

     ➜ Use case 
      • IO profile: Bandwidth? IOPS? Mixed?
        • How many IOPS or Bandwidth per client do I want to deliver?
      • Do I use Ceph in standalone or is it combined with a software solution?

     ➜ Amount of data (usable not RAW)
      • Replica count
      • How much data do I need to re-balance if a node fail?
      • Do I have a data growth planning?

     ➜ Budget  :-)

Things that you must not do

     ➜ Don't put a RAID underneath your OSD
      • Ceph already manages the replication
      • Degraded RAID breaks performances
      • Reduce usable space on the cluster

     ➜ Don't build high density nodes with a tiny cluster
      • Failure consideration and data to re-balance
      • Potential full cluster

     ➜ Don't run Ceph on your hypervisors (unless you're broke)

Best practices

For a perfect start

Workaround the journal design

The main goal here is to workaround the penalty introduced by the journal. Remember we write twice!


     ➜ Get transparent performances
     ➜ Reduce journal impact on the overall performance

Ways to implement it

How to:

     ➜ File on the OSD data disk
      • Filesystem overhead

     ➜ Block on the OSD data disk
      • Create a tiny partition using the first sectors

     ➜ Separate spinning disk
      • Good but the disk seeks a lot (concurrent journal writes)

     ➜  Separate SSD disk
      • Best way, no seek, fast sequential writes, fast access times

Network design

It's fairly easy to overcome a 1Gb link with a sequential workload.

     ➜ Link speed:
      • Link >1Gb is a must-have since 1Gb ~= 1 hdd
        • however if you only do IOPS 1Gb ~= 1/2 SSDs
      • 10Gb ethernet
      • Infiniband 40Gb or 56Gb - QDR.

     ➜ Network optimisations:
      • Jumbo frame (better CPU/bandwidth ratio)
      • Non-blocking network switch back-plane
        • = link speed * nb_servers

Public and cluster networks

Using a single NIC will lead to client writes and OSD replication going to the same link

General recommendations:

     ➜ 1 network for the client to OSD communications
     ➜ 1 network dedicated for the replication between OSDs


The more RAM you have the better caching you get!

General RAM recommendations:

     ➜ ~200MB per OSD daemon for common operation
     ➜ ~500MB - 1GB per OSD daemon during recovery
     ➜ The rest goes for the page cache
      • Hit caching = full network bandwidth


Misaligned partitions could lead to a huge performance degradation, overbearing workload. Thus SSDs tend to die more quickly.

     ➜ GPT partitions
     ➜ Partition alignment (parted -a optimal)
     ➜ Same goes for LVM based journals!


Which filesystem should I use?

Schools of thought:
     ➜ Traditional filesystems:
      • Requirement: extented attributes support
      • Supported: ext4, XFS
      • Journal mode: writeahead

     ➜ COW filesystems:
      • Supported: Brtfs but not production ready
      • Journal mode: writeparallel

PG number

Too many objects in a PG leads to a bad data balancing and disturbs the scrubbing process

General recommendations:

     ➜ PG sum = (OSDs * 100 / max_rep_count)
     ➜ PGs per pool = ((OSDs * 100 / max_rep_count) / NB_POOLS)

Made a mistake? Or want to scale up?

$ ceph osd pool set rbd pg_num 128 

Mathematical approach

If I had to build a Ceph cluster I would...

Common Cloud use case

     ➜ Use case: 
      • OpenStack Private Cloud - Glance and Cinder
      • Poor librbd caching implementation
      • IO profile: mixed workload

     ➜ 60 TB usable
      • Replica count of 2
      • Plan is to get 6 more usable Tera Byte every 6 months

     ➜ Budget has been established :-)

Journal size

A too small journal will lead to more frequent flushs, does the backed can keep-up, does the SSD itself can?

Established formula:

journal size = (2 * (expected bandwidth throughput * filestore max sync interval)) 

For a well optimised 10GB network:

12,5GB = (2 * (1250 * 5)) 

Journals per SSD?

Using a SSD to store the journal implies that the SSD must be capable to sustain all the journal writes

Established formula:
Journal number = (SSD seq write speed) / (spinning disk seq write speed) 

Example with an enterprise level SSD (~340 MB/s seq writes with o_direct and d_sync) and an enterprise level spinning disk (110 MB/s seq writes).
340/110 = 3,1

So 3 sounds reasonable and a good balance between performance and OSD loss.

Storage disks per machines?

Every disks throughput decreases your network bandwidth.

Remember since I decided to put 3 journals per SSD, my SSD bandwidth is up to 113MB/sec for each journal.

My 3:1 SSD per OSD ratio:

    ➜ 10G network with SSD journals
= 1250MB/s / 12 gives you 104MB/s

The storage server

In the end, a good Ceph server looks like (for me):

     ➜ Two 6-Cores/12Threads Intel Xeon CPU E5-2630L 
     ➜  32 GB RAM
     ➜  2x 146G SAS 15K RPM (RAID 1 for the system)
     ➜  3x SSDs Intel® Solid-State Drive 520 Series
      • Model 120GB
      • 50K IOPS and 500MB/s

     ➜  9x Seagate® Constellation SAS 1TB @ 7.2K RPM 2,5"
     ➜  2x 10GB NICs

Ceph config

And I'll configure Ceph like this:

     ➜ XFS as OSD filesystem
     ➜ 120 OSDs (JBOD disks)
     ➜ 3000 PGs for Cinder
     ➜ 3000 PGs for Glance

So now, you got an answer


     ➜ Rsocket? Somewhere next year? 
      ZFS will be a good candidate in the near future.
     ➜ Btrfs is the way to go as soon as it gets production ready.

Coming soon

A future presentation will introduce:

     ➜ Benchmark methodology
     ➜ Dive into benchmarking tools
     ➜ Interpret the results!
     ➜ Real case study

Many thanks!


Slides: https://slid.es/sebastienhan/ceph-performance-and-benchmarking/
Contact: sebastien@enovance.com