upgrade qemu on baikal to fix virtio-blk thrashing our disks #29
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Right now when a guest machine issues a trim or discard operation to the disk, the virtio driver used for the vm's qcow2 image interprets that as "write a shit load of zeros to the disk" which counts as a write and makes the disk wear out a lot faster.
Instead we want the block storage driver to pass it as a trim/discard operation which is a special operation SSDs support in order to solve this issue: it simply unlinks the storage instead of writing over it with zeros, so it doesn't wear out as quickly.
This will require a maintenance window. Here are some notes:
Here is the current trajectory of the SSDs wearing out
https://prometheus.cyberia.club/graph?g0.expr=smartmon_total_lbas_written_raw_value%7B%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=2d
so we should see those lines bend downward if the upgrade actually fixes something
smartmon_media_wearout_indicator_raw_value
also sounds good and looks the same.Ok this deployment was carried out and succeeded... ish.
We got debian, QEMU, and libvirt upgraded, but unfortunately it did not seem to fix our problem with the trims and discards . NEW capsuls have discard support OOTB, but existing ones don't, so it doesn't really fix our problem unfortunately.
j3s (he/him)
i updated my personal (old) box to 3.15 and diffed the configs, this is what i got
oh shit i got it
to enable discard support, we must:
we can easily do the change in the VM definitions. people might be annoyed if we force stop everyones VMs again
Ok we eventually figured out how to do this for the majority of the existing VMS.
Most of the vms were
pc-i440fx-3.1
and could be upgraded topc-i440fx-5.2
, which caused discards to start working on existing ones.However some of the vms were the
pc-q35
machine type ones, I was able to get 1 of those (elliot) to upgrade to pc-i440fx-5.2, however discards did not start working.However all of this enabling discards did not seem to fix the problem. Having discard support is nice but it appears the real problem is some of our users REALLY like to write to the disk a lot.
Here is # of writes per 30m according to the disk itself
https://prometheus.cyberia.club/graph?g0.expr=deriv(smartmon_media_wearout_indicator_raw_value%7B%7D%5B30m%5D)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1d&g0.end_input=2021-12-17%2007%3A52%3A45&g0.moment_input=2021-12-17%2007%3A52%3A45
Here is # of writes by capsul:
https://grafana.cyberia.club/d/jMw9xSRMz/capsul-stats?viewPanel=3&orgId=1&from=1638748463900&to=1639762070109