December/January Maintenance Window 2: How To Clean Everything #30

Closed
opened 2021-12-16 17:46:31 +00:00 by forest · 1 comment
Owner

Propaghandi -- How To Clean Everything album cover

This issue exists to track our plan for an upcoming maintenance window where we will fix the underlying cause of our disks wearing out or die trying (and restore from backup).

this is the primary reason why this maintenance is being performed:

upgrade qemu on baikal to fix virtio-blk thrashing our disks

There are a couple other changes we would like to make which also require stopping all customer VMs and/or restarting baikal:

Capsul outage mitigation: need a way to shutdown the server

Finally, there are some other changes which are "nice to haves", they are not required for this maintenance, but if we have time I would like to get them fully in place before the maintenance.

Capsul outage mitigation: capsul hub's database should not run on a capsul

backup and rollback strategy

forest (he/him)
so what happens if it goes bad?
90% chance we can at least boot and ssh in after the upgrade right ?
what happens if we can boot and ssh in, but the vms wont start or something ?
just for the sake of argument, is it possible to actually restore from backup from that state ?
just thinking about it off the top of my head, I don't know how we would actually do that
is it even possible to do that on the boot partition while booted into said boot partition ?

Nyaaori ⚛️

**forest (he/him)**
is it even possible to do that on the boot partition while booted into said boot partition ?

yes but i would not recommend it

So, we will need the ability to boot from another drive in order to restore the backup in the case that things do not work. Baikal is not in UEFI mode, so we will have to have a KVM hooked up in order to do this:

root@baikal:~# efibootmgr
EFI variables are not supported on this system.

root@baikal:~# ls -lah /sys/firmware/efi
ls: cannot access '/sys/firmware/efi': No such file or directory

So we'll need to coordinate with CyberWurx. IMO we should ask them to hook up KVM for us pre-emptively.

Either we can ask them to insert a linux recovery USB for us, or we can potentially maybe boot from an ISO file that sits as a file on the "normal" boot partition.

Nyaaori ⚛️
oh also
if you can shrink the live partition you can just add another bootable partition
wait you can literally just boot an iso from grub
https://help.ubuntu.com/community/Grub2/ISOBoot
if you have a sufficiently new grub version you should be able to add an entry for an iso
then do a bootonce into it

We have taken a full back up of the boot drive and cyberwurx is standing by to assist with KVM and recovery OS if needed.

![Propaghandi -- How To Clean Everything album cover](/attachments/84f44ee9-35f3-48dc-bd9d-b344c266fcc0) This issue exists to track our plan for an upcoming maintenance window where we will fix the underlying cause of our disks wearing out or die trying (and restore from backup). this is the primary reason why this maintenance is being performed: ## [upgrade qemu on baikal to fix virtio-blk thrashing our disks](https://git.cyberia.club/cyberia/capsul-flask/issues/29) There are a couple other changes we would like to make which also require stopping all customer VMs and/or restarting baikal: ### [Capsul outage mitigation: need a way to shutdown the server](https://git.cyberia.club/cyberia/capsul-flask/issues/13) Finally, there are some other changes which are "nice to haves", they are not required for this maintenance, but if we have time I would like to get them fully in place before the maintenance. #### [Capsul outage mitigation: capsul hub's database should not run on a capsul](https://git.cyberia.club/cyberia/capsul-flask/issues/9) # backup and rollback strategy > **forest (he/him)** > so what happens if it goes bad? > 90% chance we can at least boot and ssh in after the upgrade right ? > what happens if we can boot and ssh in, but the vms wont start or something ? > just for the sake of argument, is it possible to actually restore from backup from that state ? > just thinking about it off the top of my head, I don't know how we would actually do that > is it even possible to do that on the boot partition while booted into said boot partition ? > **Nyaaori ⚛️** > > > **forest (he/him)** > > is it even possible to do that on the boot partition while booted into said boot partition ? > > yes but i would not recommend it > So, we will need the ability to boot from another drive in order to restore the backup in the case that things do not work. Baikal is not in UEFI mode, so we will have to have a KVM hooked up in order to do this: ``` root@baikal:~# efibootmgr EFI variables are not supported on this system. root@baikal:~# ls -lah /sys/firmware/efi ls: cannot access '/sys/firmware/efi': No such file or directory ``` So we'll need to coordinate with CyberWurx. IMO we should ask them to hook up KVM for us pre-emptively. Either we can ask them to insert a linux recovery USB for us, or we can potentially maybe boot from an ISO file that sits as a file on the "normal" boot partition. > **Nyaaori ⚛️** > oh also > if you can shrink the live partition you can just add another bootable partition > wait you can literally just boot an iso from grub > https://help.ubuntu.com/community/Grub2/ISOBoot > if you have a sufficiently new grub version you should be able to add an entry for an iso > then do a bootonce into it We have taken a full back up of the boot drive and cyberwurx is standing by to assist with KVM and recovery OS if needed.
438 KiB
Author
Owner

Ok this deployment was carried out and succeeded... ish.

We got debian, QEMU, and libvirt upgraded, but unfortunately it did not seem to fix our problem with the trims and discards . NEW capsuls have discard support OOTB, but existing ones don't, so it doesn't really fix our problem unfortunately.

We also tried to deploy the systemd drop-in for fixing the shutdown process of libvirt-guests, but it was still calling the old script not our new one.. systemctl status was showing the drop in, but it wasn't actually overriding the ExecStop script we had specified.

Ok this deployment was carried out and succeeded... ish. We got debian, QEMU, and libvirt upgraded, but unfortunately it did not seem to fix our problem with the trims and discards . NEW capsuls have discard support OOTB, but existing ones don't, so it doesn't really fix our problem unfortunately. We also tried to deploy the systemd drop-in for fixing the shutdown process of libvirt-guests, but it was still calling the old script not our new one.. `systemctl status ` was showing the drop in, but it wasn't actually overriding the `ExecStop` script we had specified.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cyberia/capsul-flask#30
No description provided.