Benchmarking Ceph on a Two Node Proxmox Cluster

Benchmarking Ceph on a Two Node Proxmox Cluster

It is inadvisable to run Ceph on two nodes! That said I’ve been using a two node Ceph cluster as my primary data store for several weeks now.

In this post we look at the relative read and write performance of replicated and non-replicated Ceph pools using Rados Bench and from VM Guests using various backends. We’ll start with the results – the details of how we generated them are included after the break.

We’ll begin by exploring the impact of pool redundancy on performance using rados bench, but first we must disable scrubbing for the duration of the tests. There were several Ceph RBDs passed through to a handful of VMs during these tests, but they should have experienced negligible use.

[email protected]:$ ceph osd set noscrub && ceph osd set nodeep-scrub

My OSD tree is as following:

[email protected]:$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 26.31421 root default
-2 12.69534 host proxdave
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
2 0.44969 osd.2 up 1.00000 1.00000
5 0.44969 osd.5 up 1.00000 1.00000
6 1.81360 osd.6 up 1.00000 1.00000
9 4.54149 osd.9 up 1.00000 1.00000
10 4.54149 osd.10 up 1.00000 1.00000
-3 13.61887 host superdave
3 2.72279 osd.3 up 1.00000 1.00000
8 1.81360 osd.8 up 1.00000 1.00000
11 2.72279 osd.11 up 1.00000 1.00000
12 3.63199 osd.12 up 1.00000 1.00000
4 2.72769 osd.4 up 1.00000 1.00000

As you can see I have a random assortment of disks ranging from 500GB to 5TB. Each OSD was roughly 50% full at the time of these tests. For the purposes of Ceph my Proxmox nodes are linked by a single 10GbE DAC cable (with SolarFlare NICs if that matters).

First let’s test the replicated pool.

[email protected]:$ rados bench -p rados_bench_2 30 write --no-cleanup
Total time run: 30.608952
Total writes made: 871
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 113.823
Stddev Bandwidth: 19.4471
Max bandwidth (MB/sec): 152
Min bandwidth (MB/sec): 76
Average IOPS: 28
Stddev IOPS: 4
Max IOPS: 38
Min IOPS: 19
Average Latency(s): 0.561979
Stddev Latency(s): 0.156564
Max latency(s): 1.28033
Min latency(s): 0.188836

Not earth shattering, but not terrible. I used iftop to monitor peak traffic on the 10GbE link and it peaked at 1.2Gb/s.

Next we’ll test the un-replicated pool. I would expect this to be a lot faster.

[email protected]:$ rados bench -p rados_bench_1 30 write --no-cleanup
Total time run: 30.300715
Total writes made: 1544
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 203.824
Stddev Bandwidth: 31.079
Max bandwidth (MB/sec): 260
Min bandwidth (MB/sec): 148
Average IOPS: 50
Stddev IOPS: 7
Max IOPS: 65
Min IOPS: 37
Average Latency(s): 0.313471
Stddev Latency(s): 0.247869
Max latency(s): 1.35251
Min latency(s): 0.0353862

Pretty good! Not quite double the replicated pools performance. Interesting to note that while the minimum latency is five times lower than the replicated pool, the average latency is much closer. I expected traffic on the 10GbE link to be quite a bit lower, but it still peaked at an identical 1.2Gb/s.

What about read performance? Starting with the replicated pool. We looked at sequential and random read performance. Before each test we flushed the cache on each node using the following command. If we did not clear the caches the results were dramatically different.

[email protected]:$ sync; echo 3 > /proc/sys/vm/drop_caches

Sequential Read on Replicated Dataset:

[email protected]:$ rados bench -p rados_bench_2 30 seq
Total time run: 30.745361
Total reads made: 1255
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 163.277
Average IOPS 40
Stddev IOPS: 11
Max IOPS: 57
Min IOPS: 14
Average Latency(s): 0.390551
Max latency(s): 4.33258
Min latency(s): 0.0109083

Random Read on Replicated Dataset:

[email protected]:$ rados bench -p rados_bench_2 30 rand
Total time run: 30.732372
Total reads made: 1422
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 185.082
Average IOPS: 46
Stddev IOPS: 7
Max IOPS: 71
Min IOPS: 35
Average Latency(s):
Max latency(s): 2.1783
Min latency(s): 0.00332452

Sequential Read on Singe Instance Dataset:

[email protected]:$ rados bench -p rados_bench_1 30 seq
Total time run: 30.559431
Total reads made: 1636
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 214.14
Average IOPS 53
Stddev IOPS: 12
Max IOPS: 74
Min IOPS: 28
Average Latency(s): 0.297755
Max latency(s): 3.4695
Min latency(s): 0.0110362

Random Read on Singe Instance Dataset:

[email protected]:$ rados bench -p rados_bench_1 30 rand
Total time run: 31.124347
Total reads made: 1523
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 195.731
Average IOPS: 48
Stddev IOPS: 12
Max IOPS: 69
Min IOPS: 3
Average Latency(s): 0.325351
Max latency(s): 3.60241
Min latency(s): 0.0033494


Ceph must have read caching (or leverage system caching) as re-running the sequential read gave much better results

[email protected]:$ rados bench -p rados_bench_2 30 seq
Total time run: 30.472793
Total reads made: 6847
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 898.769
Average IOPS 224
Stddev IOPS: 158
Max IOPS: 500
Min IOPS: 69
Average Latency(s): 0.0705746
Max latency(s): 1.85529
Min latency(s): 0.0031176

Finally I sampled performance of the RBD devices from a Debian Guest. RBD devices were formatted in the guest using XFS (Yo Dawg). To benchmark the write performance we used the following DD command to write 10GB sequentially.

[email protected]:$ dd bs=1M count=10000 if=/dev/zero of=test.img conv=fdatasync

VirtIO, no cache, no threads, non-replicated: 162 MB/s
VirtIO, no cache, no threads, replicated: 74.7 MB/s
SCSI(VIO), no cache, no threads, non-replicated: 127 MB/s
VirtIO, no cache, IO threads, non-replicated: 170 MB/s
VirtIO, write-through, IO threads, non-replicated: 129 MB/s
VirtIO, write-back (unsafe), no threads, non-replicated: 150 MB/s
VirtIO, write-back (unsafe), IO threads, non-replicated: 155 MB/s

Leave a Reply

Your email address will not be published. Required fields are marked *