Service Interruption 1/27/15

Between 9:50PM PST and 10:28PM PST on January 27th 2015, most Faithlife sites and service were unable to talk to the public Internet. We’re sorry for the interruption this caused and we’re taking steps to prevent the likelihood of this happening again.

Cause

Our edge routers needed to be patched, due to the “Ghost” glibc vulnerability or CVE-2015-0235. The patching process of our primary edge router froze while updating Quagga, the daemon responsible for BGP and OSPF. The frozen patch process was subsequently killed, which unexpectedly killed the active Quagga daemon. When the Quagga daemon stops, that node is no longer able to advertise our ASN and public subnet. Normally, this should result in a very small interruption, because our secondary edge router should start advertising our public subnet to its already established BGP session with a different ISP. Unfortunately, we are in the process of making large changes to our secondary and a few of the more important routes were misconfigured. This yielded the secondary edge router mostly unusable. Because the patch was being applied remotely over a VPN connection that relies on OSPF to talk to the router, the router was inaccessible. Due to the inaccessibility of the primary edge router, we drove to the data center immediately, physically connected to the machine, completed the patching, and restarted the Quagga daemon.

What We’re Doing

We’re currently going through a re-configuration of our edge routers and firewalls which will enable us to advertise our ASN and public subnet from multiple geographically diverse locations with different Internet Service Providers. This is actually a project that we hoped to have completed before going in to 2015, but contracts and difficulties with the physical layer proved tougher than expected. Once this is complete, an issue like this should only cause a very small interruption of service for a subset of our users. Additionally, we’ll be adding console switches with multiple out of band connectivity options so that we shouldn’t have to worry about burning the time it takes to run to the datacenter or create a remote hands ticket.

Hardware – Part I (Network)

The hardware powering Faithlife has seen a massive transformation in the last eighteen months. We’re really excited about all the cool new changes, and the measurable impact they’ve had on our employees, customers, and the products / features we’re able to offer. Given that, we thought that sharing our hardware configuration was a fun way to live our values and showcase what we think is pretty cool.

Philosophy

At Faithlife we value smart, versatile learners, and automation, over expensive vendor solutions. Smart, versatile learners don’t lose value when technology changes or the company changes direction, vendor solutions often do. If we can use commodity hardware and free open source software to replace expensive vendor solutions, we do.

Commodity hardware is generally re-configurable and reusable, and lets us treat our hardware like Lego Bricks. Free open source software allows us to see behind the curtain, and more easily work with other existing tools. We’re empowered to fix our own issues by utilizing the talent we already employ, not just sit on our hands waiting for a vendor support engineer to help us out (though we do like to keep that option available when possible). Additionally, combining commodity hardware with automation tools like Puppet, we’re able to be nimble.

By being nimble, leveraging in house talent, Lego Brick-ish hardware, and free open source software, we’re able to save a considerable amount of cash. Saving cash on operational expenses enables us to make business decisions that would have otherwise been cost prohibitive. At Faithlife we have large company problems, with a small company budget.

Network hardware

Not long ago we were exhausting a variety of Cisco, and F5 1Gb network gear. Bottlenecks were popping up left and right, packet loss was high, retransmits were through the roof, and changes to network hardware happened at a glacial pace. We were beyond the limits of 1Gb, our topology was problematic, and shortcuts were continually being taken in order to keep up with the demand of our sites and services. At the same time, we had just begun the process of moving to Puppet and automating our server deployments, which meant we could easily outpace network changes. Additionally, the gear did not a fit our hardware philosophy.

Fast forward to today, our current data center topology is a modified spine and leaf, or “folded clos” design. We use OSPF to route traffic between cabinets and a pair of leaf switches are placed in each cabinet. The leaf switch pairs represent a layer 2 boundary and allow us to MLAG our servers to maintain switch redundancy within the layer 2 boundary. In addition, a pair of spine switches are placed in an end of row networking cabinet. We have multiple edge routers and firewalls connected to an area border router via OSPF. Furthermore, the edge routers are connected to ISPs via BGP.

Spine

Dell S6000-ON and Penguin Arctica 3200XL — both run Cumulus Linux

  • 32 Ports of 40Gb QSFP+

Leaf / Area Border Router

Dell S4810-ON and Penguin Arctica 4804X — both run Cumulus Linux

  • 48 Ports of 10Gb SFP+ plus 4 ports of 40Gb QSFP+

Management

Penguin Arctica 4804i — running Cumulus Linux

  • 48 Ports of 1Gb plus 4 ports of 10Gb SFP+

Edge Router / Firewall

Dell R610 1U Servers:

  • Dual Intel X520-DA2 NIC with Intel SFP+ optics
  • Dual Intel X5650 CPU
  • 96GB of RAM (Helps with Internet routing tables, IPS, firewall states, etc.)

Routers run Ubuntu Linux with Quagga for OSPF and BGP.

Firewalls run PFSense (FreeBSD based) with Quagga for OSPF, and Suricata for IPS.

Cables

Amphenol 10Gb SFP+ DAC

Amphenol 40Gb QSFP+ DAC

Amphenol 40Gb QSFP+ Active Optical

FiberStore multi-mode fiber

Transceivers

FiberStore 10Gb SR Optics

Intel 10Gb SR and LR Optics (for compatibility with X520-DA2 cards)

Seattle Data Center Network Cabinet

(please excuse the screwdriver and loose fiber, this was a work in progress at the time)

SeattleNetworkRack