Context Link to heading
I have a few servers running Ubuntu 24.04 and LXD for virtualization.
When I first set things up the recommendations I found were to use either BTRFS or ZFS for the data pool. Well, I already had some btrfs experience and it was the default listed when I first ran lxd init for the data pool so ¯\_(ツ)_/¯
I ran LXC containers of minimal Ubuntu and had a few MySQL databases running. And the IOPS seemed fine, but then after some time the host showed kernel interrupts resetting the NVMe controller. Well 💩 Maybe the drive is just going bad?
I checked the nvme smart-log:
root@srv3:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 42 °C (315 K)
available_spare : 100%
available_spare_threshold : 1%
percentage_used : 9%
endurance group critical warning summary: 0
Data Units Read : 6511080360 (3.33 PB)
Data Units Written : 46253039 (23.68 TB)
host_read_commands : 36934739756
host_write_commands : 1080500554
controller_busy_time : 162856
power_cycles : 18
power_on_hours : 5358
unsafe_shutdowns : 14
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 42 °C (315 K)
Temperature Sensor 2 : 28 °C (301 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Yikes, 9% used, thats not great… but then again there was a ton of reads and some writes… but that can’t be it?
Every once in a while, when using VSCode and some of the RHEL based Java extensions I could get server CPU spikes and things would go all sideways and the container would just be unresponsive and I’d have to kill it.
I did some of the normal things like log parsing to see about coordinating errors and bla bla. I swapped an NVMe that was in the box with the most kernel interrupts… eventually the problems returned on the new drive!
Everything is falling apart.
What should I try? Link to heading
Only after watching an interview with the creator of LXD/Incus where it seemed like ZFS might be the right way to go… and some other place that mentioned ZFS’s ARC was magic and it was worth a shot.
I put the “bad” NVMe in a USB housing and did a big shuffle swap. I settled on zpool create -o ashift=12 -O compression=lz4 -O atime=off <pool_name> /dev/disk/by-id/<id>
ashift=12(4k blocks) on zfs to match the native/actual block size for NVMecompression=lz4for a the price of a little CPU fewer bits have to go down the PCIe lanes and get written to fewer NVMe blocks.atime=offfurther reducing the writes
Google the details since it has to be configured at pool creation and there are some more tuning you can do for DBs (I didn’t do that yet since baby steps, but I might not)
Results? Link to heading
Night and day. Before I would have huge periodic CPU spikes and IOPS starvation. Now its all good. The ARC cache is doing it’s fair share of work ensuring things are warm for reads and that probably helps a lot too…
root@server:~# arcstat 1
time read ddread ddh% dmread dmh% pread ph% size c avail
09:28:34 0 0 0 0 0 0 0 62G 62G 1.8G
09:28:35 154 0 0 154 100 0 0 62G 62G 1.8G
09:28:36 777 612 93 164 100 1 100 62G 62G 1.8G
09:28:37 0 0 0 0 0 0 0 62G 62G 1.8G
09:28:38 0 0 0 0 0 0 0 62G 62G 1.8G
root@srv2:~# zpool iostat -vl 1
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild
pool alloc free read write read write read write read write read write read write wait wait wait
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
default 1.18T 2.45T 41 16 4.61M 155K 200us 98us 193us 36us 1us 1us 138us 64us 516us - -
nvme-WD_BLACK_SN850X_4000GB_xxxxxxxxxxx 1.18T 2.45T 41 16 4.61M 155K 200us 98us 193us 36us 1us 1us 138us 64us 516us - -
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild
pool alloc free read write read write read write read write read write read write wait wait wait
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
default 1.18T 2.45T 0 0 91.9K 0 393us - 393us - 3us - - - - - -
nvme-WD_BLACK_SN850X_4000GB_xxxxxxxxxxx 1.18T 2.45T 0 0 91.9K 0 393us - 393us - 3us - - - - - -
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild
pool alloc free read write read write read write read write read write read write wait wait wait
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
default 1.18T 2.45T 0 107 104K 979K 393us 86us 393us 34us 3us 1us - 54us - - -
nvme-WD_BLACK_SN850X_4000GB_xxxxxxxxxxx 1.18T 2.45T 0 107 104K 979K 393us 86us 393us 34us 3us 1us - 54us - - -
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild
pool alloc free read write read write read write read write read write read write wait wait wait
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
default 1.18T 2.45T 3 0 363K 0 393us - 393us - 3us - 1us - - - -
nvme-WD_BLACK_SN850X_4000GB_xxxxxxxxxxx 1.18T 2.45T 3 0 363K 0 393us - 393us - 3us - 1us - - - -
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild
pool alloc free read write read write read write read write read write read write wait wait wait
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
default 1.18T 2.45T 0 0 104K 0 393us - 393us - 3us - - - - - -
nvme-WD_BLACK_SN850X_4000GB_xxxxxxxxxxx 1.18T 2.45T 0 0 104K 0 393us - 393us - 3us - - - - - -
------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
If this helps even one single other person then it’s worth it.