No description
Find a file
2022-06-12 03:10:46 +00:00
ansible comment out goatcounter vars_prompt 2022-05-24 02:10:36 -05:00
builds Make prom-collect generic 2021-04-02 04:01:13 +00:00
docs add "Debugging & OAuth Tokens" section to forge docs 2021-04-22 10:06:58 -05:00
element-web-custom-favicon add custom matrix favicon setup 2021-09-19 17:38:42 -05:00
howto add web application to 'how to invite a new user' page 2021-05-01 19:11:46 -05:00
rules change postgres offsite backup from critical to info 2021-11-27 18:16:03 +00:00
README.md Add new memberz 2020-10-10 18:29:17 -05:00

Cyberia operations handbook

This project provides guidance for those wonderful operations members who are technically "on the hook" for all Cyberia-related production services, especially

  • Capsul
  • Matrix
  • Forge
  • Jitsi (Cafe)
  • Mumble

Our current list of operations members is:

  • fack
  • forest
  • j3s
  • plantdaddy
  • skh
  • vvesley

Ansible

For information on how to run ansible, see the README.md in the ansible folder.

On-Call

Operations provides around the clock shared support. We are all technically on-call 24/7/365.

Things to keep an eye on

All operations members must:

  • have a Matrix account and join #ops:cyberia.club and #services:cyberia.club
  • have a Forge account

Alerts

Check alerts on Prometheus

This handbook contains the above alert definitions, and they are pulled down into Prometheus automatically when updated.

You are welcome to adjust, add, and remove alerts

Graphs

Grafana graphs are currently available here

Informational alerts

labels:
  severity: info

Informational alerts are free to be ignored, but indicate curious happenings in our infrastructure. Operations members are not expected to look at or react to these.

Critical alerts

labels:
  severity: critical

Critical alerts are those that indicate a failing of one or more of our services.

When a critical alert fires, we should respond with a sense of urgency.

If a critical alert fires and there was no associated outage, it is our shared responsibility to eliminate that alert. All critical alerts must be actionable. If a critical alert can be resolved by a cronjob, it should be resolved by a cronjob and removed as an alert.

Incidents

Don't panic. If you are feeling overwhelmed, please contact another operations member about the issue at hand.

Communication

We will communicate in the #ops:cyberia.club Matrix channel. If that channel is down, we will communicate via cafe.cyberia.club/ops

Communicate your activity. If you are bouncing a machine, please notify #ops that you are bouncing a machine. If you are reloading a service, tell #ops so that people don't step on each others toes. This is also important to establish a timeline.

References

Capsul

Prometheus

Matrix

Email

Forge

Git