So, VITAS and me investigated the reason of the outage last week, which likely was that the server run out of memory. There were some syslog messages about the Apache web server unable to write to the access log because it was out of device space, but the /var/log partition was not full.
VITAS looked at the stats again an hour ago, and gunicorn was hoarding all the memory, even swap space was completely used up, which rules out caching.
My guess is it's a gunicorn memory leak.
I've found the following in the gunicorns docs, and I'm trying it out on alpha right now.
https://docs.gunicorn.org/en/latest/settings.html#max-requests
`--max-requests` restarts each worker after XXX requests, which should mitigate the problem if the workers have a memory leak.
I think it's worth a shot, if it doesn't work, we can at least rule out the workers from being the problem.