Are you seeing lots of `slow requests are blocked` errors during high throughput on your Ceph storage?
We were experiencing serious issues on two supermicro nodes with IOMMU enabled (Keywords: dmar dma pte vpfn) but even on our ASRack C2750 system things weren’t behaving as they should.
We were tearing our hair out trying to figure out what was going on. Especially as we had been using my Solarflare Dual SFP+ 10GB NICs for non-ceph purposes for years.
The answer in this case was to manually install the sfc driver from Solarflare’s website (kudos to solarflare for providing active driver releases covering 5+ year old hardware btw).
Check existing driver:
$ modinfo sfc --- version: 4.1 ---
Download the driver:
Install alien, kernel headers and dkms:
apt-get install alien pve-headers dkms
Extract the RPM and convert to .deb:
alien -c sfc-dkms-184.108.40.2064-0.sf.1.noarch.rpm
Build and install:
dpkg -i sfc-dkms_220.127.116.114-1_all.deb
Check driver was updated correctly:
--- version: 18.104.22.1684 ---
After this we experienced no further slow request warnings or timed out file transfers even under intense sustained IO.