Service Interruption 1/27/15

Between 9:50PM PST and 10:28PM PST on January 27th 2015, most Faithlife sites and service were unable to talk to the public Internet. We’re sorry for the interruption this caused and we’re taking steps to prevent the likelihood of this happening again.

Cause

Our edge routers needed to be patched, due to the “Ghost” glibc vulnerability or CVE-2015-0235. The patching process of our primary edge router froze while updating Quagga, the daemon responsible for BGP and OSPF. The frozen patch process was subsequently killed, which unexpectedly killed the active Quagga daemon. When the Quagga daemon stops, that node is no longer able to advertise our ASN and public subnet. Normally, this should result in a very small interruption, because our secondary edge router should start advertising our public subnet to its already established BGP session with a different ISP. Unfortunately, we are in the process of making large changes to our secondary and a few of the more important routes were misconfigured. This yielded the secondary edge router mostly unusable. Because the patch was being applied remotely over a VPN connection that relies on OSPF to talk to the router, the router was inaccessible. Due to the inaccessibility of the primary edge router, we drove to the data center immediately, physically connected to the machine, completed the patching, and restarted the Quagga daemon.

What We’re Doing

We’re currently going through a re-configuration of our edge routers and firewalls which will enable us to advertise our ASN and public subnet from multiple geographically diverse locations with different Internet Service Providers. This is actually a project that we hoped to have completed before going in to 2015, but contracts and difficulties with the physical layer proved tougher than expected. Once this is complete, an issue like this should only cause a very small interruption of service for a subset of our users. Additionally, we’ll be adding console switches with multiple out of band connectivity options so that we shouldn’t have to worry about burning the time it takes to run to the datacenter or create a remote hands ticket.

Hardware – Part I (Network)

The hardware powering Faithlife has seen a massive transformation in the last eighteen months. We’re really excited about all the cool new changes, and the measurable impact they’ve had on our employees, customers, and the products / features we’re able to offer. Given that, we thought that sharing our hardware configuration was a fun way to live our values and showcase what we think is pretty cool.

Philosophy

At Faithlife we value smart, versatile learners, and automation, over expensive vendor solutions. Smart, versatile learners don’t lose value when technology changes or the company changes direction, vendor solutions often do. If we can use commodity hardware and free open source software to replace expensive vendor solutions, we do.

Commodity hardware is generally re-configurable and reusable, and lets us treat our hardware like Lego Bricks. Free open source software allows us to see behind the curtain, and more easily work with other existing tools. We’re empowered to fix our own issues by utilizing the talent we already employ, not just sit on our hands waiting for a vendor support engineer to help us out (though we do like to keep that option available when possible). Additionally, combining commodity hardware with automation tools like Puppet, we’re able to be nimble.

By being nimble, leveraging in house talent, Lego Brick-ish hardware, and free open source software, we’re able to save a considerable amount of cash. Saving cash on operational expenses enables us to make business decisions that would have otherwise been cost prohibitive. At Faithlife we have large company problems, with a small company budget.

Network hardware

Not long ago we were exhausting a variety of Cisco, and F5 1Gb network gear. Bottlenecks were popping up left and right, packet loss was high, retransmits were through the roof, and changes to network hardware happened at a glacial pace. We were beyond the limits of 1Gb, our topology was problematic, and shortcuts were continually being taken in order to keep up with the demand of our sites and services. At the same time, we had just begun the process of moving to Puppet and automating our server deployments, which meant we could easily outpace network changes. Additionally, the gear did not a fit our hardware philosophy.

Fast forward to today, our current data center topology is a modified spine and leaf, or “folded clos” design. We use OSPF to route traffic between cabinets and a pair of leaf switches are placed in each cabinet. The leaf switch pairs represent a layer 2 boundary and allow us to MLAG our servers to maintain switch redundancy within the layer 2 boundary. In addition, a pair of spine switches are placed in an end of row networking cabinet. We have multiple edge routers and firewalls connected to an area border router via OSPF. Furthermore, the edge routers are connected to ISPs via BGP.

Spine

Dell S6000-ON and Penguin Arctica 3200XL — both run Cumulus Linux

  • 32 Ports of 40Gb QSFP+

Leaf / Area Border Router

Dell S4810-ON and Penguin Arctica 4804X — both run Cumulus Linux

  • 48 Ports of 10Gb SFP+ plus 4 ports of 40Gb QSFP+

Management

Penguin Arctica 4804i — running Cumulus Linux

  • 48 Ports of 1Gb plus 4 ports of 10Gb SFP+

Edge Router / Firewall

Dell R610 1U Servers:

  • Dual Intel X520-DA2 NIC with Intel SFP+ optics
  • Dual Intel X5650 CPU
  • 96GB of RAM (Helps with Internet routing tables, IPS, firewall states, etc.)

Routers run Ubuntu Linux with Quagga for OSPF and BGP.

Firewalls run PFSense (FreeBSD based) with Quagga for OSPF, and Suricata for IPS.

Cables

Amphenol 10Gb SFP+ DAC

Amphenol 40Gb QSFP+ DAC

Amphenol 40Gb QSFP+ Active Optical

FiberStore multi-mode fiber

Transceivers

FiberStore 10Gb SR Optics

Intel 10Gb SR and LR Optics (for compatibility with X520-DA2 cards)

Seattle Data Center Network Cabinet

(please excuse the screwdriver and loose fiber, this was a work in progress at the time)

SeattleNetworkRack

SATApocalypse

Storage unavailability Friday November 21st – 26th, 2014

I’d like to apologize for the trouble you undoubtedly had accessing Faithlife products and services between November 21st and 26th. The reliability and availability of Faithlife products and services is critical to your success and ours. Understanding what happened is a necessary step towards reducing the probability of this type of event happening again.

Summary of events (All times approximate)

4:00 PM Pacific on November 21st, a storage pool in our Bellingham data center had three of fifty-five drives marked down and out due to a failure to respond within five minutes to the rest of the cluster. Since our storage pools are configured to be triple-redundant, the cluster began a rebalance of its data to ensure the triple-redundant guarantee. Normally, a three drive failure and rebalance would be a minor inconvenience. Unfortunately, so many virtual machines had been provisioned on this pool during the Logos 6 launch that IOPS demands on the pool were already at or above the pools capability. The result was slow, but available disk. The three problematic disks were identified, but our logs and monitoring software did not point to an actual disk failure. The problem disks were manually marked down and out to prevent them from coming back in the cluster. Since there was plenty of redundant disk and things were functional, the plan was to replace the problem disks the next morning.

10:45 PM Pacific on November 21st, the rebalance stalled and disk operations were extremely degraded. Stalled object storage daemons were re-started one at a time. The rebalance continued and storage was somewhat usable again.

2:30 AM Pacific on November 22nd, four more drives were marked down and out. Enough disks had been lost that a large portion of the storage pool was experiencing paused disk operations as a protection against data loss. This event took a large portion of our web infrastructure down and left only a few systems able to function in a degraded state. Our monitoring systems did not produce data that suggested these disks were unhealthy. However, operating system logs pointed to problems with the XFS partitions. Further investigation showed that the disk controllers marked these four drives as critical and that one of the controllers had its battery backed cache die. The four failed drives were manually marked down and out, and we headed to the data center to build up a new storage pool node. This node was to take the place of the failed drives, allow the cluster to start healing, and unpause disk operations. We also planned to immediately replace the controller with the failed battery backed cache.

6:30 AM Pacific on November 22nd, when replacing the failed battery backed cache, the power cord for one of the active and healthy nodes was accidentally pulled. The node was immediately brought back into service, but the sudden power loss resulted in two journal partitions becoming corrupt and the loss of the object storage daemons backing them. This brought the lost drive count to nine of fifty-five and furthered the degraded state of the pool. The battery backed cache was properly replaced, and the new storage node was added by approximately 9:30 AM Pacific. The rebalance was able to continue, but at a very slow rate. We estimated it would take thirty-six hours before the pool was in a usable state. All available production resources had been consumed by the Logos 6 launch, and the decision was made to pull all resources from our on premise lab and build out a parallel cloud deployment. This would allow us to quickly replace affected virtual machines while the storage pool recovered. In the meantime, virtual machines hosted by the affected storage pool were shut down to prevent them from servicing live requests when they were periodically available.

9:00 PM Pacific on November 22nd, gear was obtained, racked and provisioned at our Bellingham data center. Proclaim and Commerce related sites and services were chosen as first recipients of the new deployment.

11:00 PM Pacific on November 22nd, all of Proclaim and its related sites and services were functional. Commerce related virtual machines were provisioned and awaiting final configuration and code deployment. Other sites and services were provisioned and deployed as new hardware became available in the following days.

Between November 24th and November 25th, functionality had been restored to all but our Exchange deployment. We did not want to restore Exchange from backup on to alternative deployment because it meant losing some email. Our efforts turned entirely to successful recovery of the storage pool.

The storage pool rebalance had essentially finished, but writes were still paused. The pool had five incomplete and stuck placement groups, and hundreds of slow requests. Hope of a normal recovery was gone and we began working through documentation for troubleshooting slow requests and incomplete placement groups.

The documentation pointed us at four possible causes: a bad disk, file system/kernel bug, overloaded cluster, or an object storage daemon bug. It also proposed four possible resolutions: shutdown virtual machines to reduce load, upgrade the kernel, upgrade Ceph, or restart object storage daemons with slow requests. Disks were replaced, virtual machines were already shut off, and ceph was upgraded. Upgrading the kernel was not an appealing option because restarts would be required. Restarts meant either letting a rebalance happen while the drives went away, or placing the cluster in a no-recover state. Further rebalancing would put more stress on disks and put us at risk of losing more drives. Putting the cluster in a no-recover state, even momentarily, seemed inappropriate. Since it appeared that the five incomplete placement groups were causing the paused writes, the decision was made to mark the placement groups lost and deal with any potential data loss. Unfortunately, the cluster refused to respect marking these placement groups as lost. At this point we worked on the assumption that we’d hit a bug in Ceph and engaged the Ceph IRC channel, which proved unhelpful.

We felt as if our options consisted of digging in to Ceph source code, or engaging InkTank support. We felt it necessary to make engaging InkTank support the first step. We were lucky enough to get six hours of free support from InkTank while they set up our newly purchased support contract. Their engineer walked through many of the same steps we had, and we were able to provide them with output and logs that accelerated their troubleshooting. It was decided by the InkTank engineer that we had hit bug in Ceph and potentially an XFS bug in the particular Linux kernel used on this storage pool. The five placement groups in question were not assigned to any storage pools, which is a state that should never happen. After talking with Ceph developers, the InkTank engineer provided us with steps to work around the bug.

Unfortunately, the resolution included losing the data stored on the five placement groups. The data loss materialized as lost sectors to virtual machines, which meant running fsck/chkdsk on hundreds of virtual machines. The other fall out is that the Exchange databases needed a lot of repair.

How we’re changing

Try as they may to be redundant, OpenStack and Ceph architecturally force non-obvious single points of failure. Ceph is a nice transition away from traditional storage, but at the end of the day it is just a different implementation of the same thing. SAN and Software Defined Storage are all single points of failure when used for virtual machine storage. OpenStack enabled us to scale massively with commodity hardware, but proved unsustainable operationally speaking.

Starting with our emergency cloud deployment, we’ve moved away from OpenStack and centralized storage. Instead, we’ve gone with Joyent’s SmartDataCenter 7. SmartDataCenter 7 has made some key architectural decisions that better fit with our infrastructure philosophies. Simply put, each physical host in SmartDataCenter 7 is capable of surviving on its own as long as power and network are available.

Even great products like SmartDataCenter 7 can’t run if our data center suffers a power, cooling, or connectivity failure, which is why we’ve been working hard the last few months to get our brand new Seattle-area data center online. Not only will we have redundant hardware in different geographic locations, we’ll also have far more Internet connectivity in Seattle. This will result in reduced latency for our customers and the ability to withstand routing failures at the Internet Service Provider level.

Over the last year, the Development and Operations departments at Faithlife have had a large cultural shift which includes increased collaboration, shared responsibility, the removal of artificial boundaries that create “not my problem” scenarios, and making tooling, automation and alerting first class products. Still, we have a lot of room for cultural growth. Admittedly, we knew our storage was running at or above its capability before disaster struck. However, because of a cultural tension between old and new, there was a real fear that changing anything at this critical time was more risky than just leaving things alone. This is a fallacy and Baron Schwartz points this out better than I can in his blog post Why Deployment Freezes Don’t Prevent Outages.

Our Operations team went through a full year of being stretched mentally and physically. Not only is that not healthy for Faithlife’s employees, it reduces our quality of work and decision making. So we’re adding more people to the teams that support Faithlife’s infrastructure and making a proper work / life balance one of the most important goals for 2015.

I can’t say enough about the amazing team we have here at Faithlife. Operations and Development came together and worked an insane amount of hours to mitigate and solve this massive problem in a very short amount of time.

Thank you

Many of our customers left encouraging feedback in our forums during this outage, and I want to thank you all for that. The encouraging feedback was an uplift during a very trying time. Furthermore, thank you all for your business and understanding.