Between 9:50PM PST and 10:28PM PST on January 27th 2015, most Faithlife sites and service were unable to talk to the public Internet. We’re sorry for the interruption this caused and we’re taking steps to prevent the likelihood of this happening again.
Our edge routers needed to be patched, due to the “Ghost” glibc vulnerability or CVE-2015-0235. The patching process of our primary edge router froze while updating Quagga, the daemon responsible for BGP and OSPF. The frozen patch process was subsequently killed, which unexpectedly killed the active Quagga daemon. When the Quagga daemon stops, that node is no longer able to advertise our ASN and public subnet. Normally, this should result in a very small interruption, because our secondary edge router should start advertising our public subnet to its already established BGP session with a different ISP. Unfortunately, we are in the process of making large changes to our secondary and a few of the more important routes were misconfigured. This yielded the secondary edge router mostly unusable. Because the patch was being applied remotely over a VPN connection that relies on OSPF to talk to the router, the router was inaccessible. Due to the inaccessibility of the primary edge router, we drove to the data center immediately, physically connected to the machine, completed the patching, and restarted the Quagga daemon.
What We’re Doing
We’re currently going through a re-configuration of our edge routers and firewalls which will enable us to advertise our ASN and public subnet from multiple geographically diverse locations with different Internet Service Providers. This is actually a project that we hoped to have completed before going in to 2015, but contracts and difficulties with the physical layer proved tougher than expected. Once this is complete, an issue like this should only cause a very small interruption of service for a subset of our users. Additionally, we’ll be adding console switches with multiple out of band connectivity options so that we shouldn’t have to worry about burning the time it takes to run to the datacenter or create a remote hands ticket.