Exiting 2010 with a crash... and a bang... and a boom. So yeah, that just happened.

This morning our new SAN suffered a hang, and while it was fairly academic to get it back running (ZFS is pretty awesome), it took me quite some time to find the radical MTU mismatch between the server and the FreeBSD VPS nodes. Interestingly, Linux copes with mismatched MTUs without any issue. The BSD machines would just block on NFS IO and hang.

Figured that all out just before 3PM.

Just before 4PM, the entire Toronto infrastructure disappeared from the Internet.

Our first thoughts are power, because given the multiple routers, multiple switches and multiple uplinks, nothing should be about to wipe out everything - other than a power failure.

Or both of our core routers failing.

Chainsaw crashed once back in August, and never before, and never since, until today, that is. That too was a double-failure, because Jigsaw had been crashing every few weeks - later we determined it to be a faulty hard-drive in Jigsaw.

Since then Jigsaw was temporarily replaced by Seesaw (...yes.), but immediately it started crashing too - eventually I caught it in happening in slow-motion and realized it was the new flowtable system in FreeBSD. A bit of tuning and tweaking, and then a bunch of wait-and-see if it crashes. And it didn't. Hooooray. 'Course, while we were waiting to prove a negative, we didn't spend too much energy on auditing the configuration because we weren't sure if we were going to have to do it all over again shortly. And at that point, Jigsaw was still somewhat usable.

But... somewhere in there, I made a tyop configuring one of the uplinks on Seesaw, and the the other uplink provider changed their configuration (unbeknownst to us) - so when Chainsaw crashed today, Seesaw was completely unable to takeover... and I was left with no backdoor to reconfigure things. (We can't have as much redundancy on Jigsaw/Seesaw as we do on Chainsaw, unfortunately. That's why Chainsaw is considered the Primary.)

Called in one of our techs, who made it on-site with 15 minutes, and realized that Chainsaw's hardware is fried. And then we spent the next 45 minutes reconfiguring BGP from scratch, struggling with the typo and unknown changes... eventually we got it going.

Right now we're down one of our DNS servers, so please be patient, or try other servers if you want. Hope to have that solved in the next 24 hours.

3 elements wiped us out this afternoon: 1) hardware, 2) months old typographical error, 3) undocumented configuration change.

Virtualmin hosting down We've experienced a bizarre hardware failure with the server that runs the majority of our webhosting customers. It will be a few hours till things are back in order.

Montreal Maintenance Tonight Both the mailserver and core router in Montréal will be undergoing hardware repairs tonight, these services will be unavailable between 11PM and 4AM, but hopefully much less than that.

E-Mail running a bit slow Our mailserver had a hard-drive fail a few days ago, since they're in a RAID1 configuration, the read-speed has been halved and that's causing backups and regular maintenance to run somewhat slower. This should be resolved early next week.

