Those of you in other parts of the world or who perhaps spend the late nights online may have noticed that the websites went down last night (approx 22:38 UTC/GMT). I got alerted by the monitoring I've put in place, so was able to get onto it pretty quickly - unfortunately it seems like we had some corruption to our filesystem, and the script that would normally help fix that ended up causing a few issues (a few files needed to successfully reboot the web services on the server ended up getting moved to the 'lost+found' directory because the script couldn't work out what they were meant to be for).
This meant that I had to do a restore of our most recent backup - the daily backup would have taken us back to about 9am yesterday, but luckily our monthly backup had run at about 13:40 GMT - so that's the one we restored to.
You'll need to work out for yourself what time that would be in your time zone (for us Brits, that would mean 14:40, since we're an hour ahead). Anything that was done on the sites between then and the time the servers went down will have been lost, but luckily this is only a period of about 10 hours or so. I recommend for everyone to check their emails to see if they received any posts or tag updates in that time period, and if you did, you'll need to go back into your sites and copy and paste the new stuff from the email back into the site.
The sites were restored at about 00:40 GMT this morning, so we only had about 2 hours of actual downtime. In the future, I might look into potential new ways to add some resiliency, at least to the main trio of fleet sites (at work we use Amazon Web Services, which lets you spin up two identical servers, and have a load balancer between them; so if one has an issue, the other one picks up the slack).
Having a problem with your site? Post in here, and we'll see what we can do to help!
1 post • Page 1 of 1