Storage unavailability Friday November 21st – 26th, 2014
I’d like to apologize for the trouble you undoubtedly had accessing Faithlife products and services between November 21st and 26th. The reliability and availability of Faithlife products and services is critical to your success and ours. Understanding what happened is a necessary step towards reducing the probability of this type of event happening again.
Summary of events (All times approximate)
4:00 PM Pacific on November 21st, a storage pool in our Bellingham data center had three of fifty-five drives marked down and out due to a failure to respond within five minutes to the rest of the cluster. Since our storage pools are configured to be triple-redundant, the cluster began a rebalance of its data to ensure the triple-redundant guarantee. Normally, a three drive failure and rebalance would be a minor inconvenience. Unfortunately, so many virtual machines had been provisioned on this pool during the Logos 6 launch that IOPS demands on the pool were already at or above the pools capability. The result was slow, but available disk. The three problematic disks were identified, but our logs and monitoring software did not point to an actual disk failure. The problem disks were manually marked down and out to prevent them from coming back in the cluster. Since there was plenty of redundant disk and things were functional, the plan was to replace the problem disks the next morning.
10:45 PM Pacific on November 21st, the rebalance stalled and disk operations were extremely degraded. Stalled object storage daemons were re-started one at a time. The rebalance continued and storage was somewhat usable again.
2:30 AM Pacific on November 22nd, four more drives were marked down and out. Enough disks had been lost that a large portion of the storage pool was experiencing paused disk operations as a protection against data loss. This event took a large portion of our web infrastructure down and left only a few systems able to function in a degraded state. Our monitoring systems did not produce data that suggested these disks were unhealthy. However, operating system logs pointed to problems with the XFS partitions. Further investigation showed that the disk controllers marked these four drives as critical and that one of the controllers had its battery backed cache die. The four failed drives were manually marked down and out, and we headed to the data center to build up a new storage pool node. This node was to take the place of the failed drives, allow the cluster to start healing, and unpause disk operations. We also planned to immediately replace the controller with the failed battery backed cache.
6:30 AM Pacific on November 22nd, when replacing the failed battery backed cache, the power cord for one of the active and healthy nodes was accidentally pulled. The node was immediately brought back into service, but the sudden power loss resulted in two journal partitions becoming corrupt and the loss of the object storage daemons backing them. This brought the lost drive count to nine of fifty-five and furthered the degraded state of the pool. The battery backed cache was properly replaced, and the new storage node was added by approximately 9:30 AM Pacific. The rebalance was able to continue, but at a very slow rate. We estimated it would take thirty-six hours before the pool was in a usable state. All available production resources had been consumed by the Logos 6 launch, and the decision was made to pull all resources from our on premise lab and build out a parallel cloud deployment. This would allow us to quickly replace affected virtual machines while the storage pool recovered. In the meantime, virtual machines hosted by the affected storage pool were shut down to prevent them from servicing live requests when they were periodically available.
9:00 PM Pacific on November 22nd, gear was obtained, racked and provisioned at our Bellingham data center. Proclaim and Commerce related sites and services were chosen as first recipients of the new deployment.
11:00 PM Pacific on November 22nd, all of Proclaim and its related sites and services were functional. Commerce related virtual machines were provisioned and awaiting final configuration and code deployment. Other sites and services were provisioned and deployed as new hardware became available in the following days.
Between November 24th and November 25th, functionality had been restored to all but our Exchange deployment. We did not want to restore Exchange from backup on to alternative deployment because it meant losing some email. Our efforts turned entirely to successful recovery of the storage pool.
The storage pool rebalance had essentially finished, but writes were still paused. The pool had five incomplete and stuck placement groups, and hundreds of slow requests. Hope of a normal recovery was gone and we began working through documentation for troubleshooting slow requests and incomplete placement groups.
The documentation pointed us at four possible causes: a bad disk, file system/kernel bug, overloaded cluster, or an object storage daemon bug. It also proposed four possible resolutions: shutdown virtual machines to reduce load, upgrade the kernel, upgrade Ceph, or restart object storage daemons with slow requests. Disks were replaced, virtual machines were already shut off, and ceph was upgraded. Upgrading the kernel was not an appealing option because restarts would be required. Restarts meant either letting a rebalance happen while the drives went away, or placing the cluster in a no-recover state. Further rebalancing would put more stress on disks and put us at risk of losing more drives. Putting the cluster in a no-recover state, even momentarily, seemed inappropriate. Since it appeared that the five incomplete placement groups were causing the paused writes, the decision was made to mark the placement groups lost and deal with any potential data loss. Unfortunately, the cluster refused to respect marking these placement groups as lost. At this point we worked on the assumption that we’d hit a bug in Ceph and engaged the Ceph IRC channel, which proved unhelpful.
We felt as if our options consisted of digging in to Ceph source code, or engaging InkTank support. We felt it necessary to make engaging InkTank support the first step. We were lucky enough to get six hours of free support from InkTank while they set up our newly purchased support contract. Their engineer walked through many of the same steps we had, and we were able to provide them with output and logs that accelerated their troubleshooting. It was decided by the InkTank engineer that we had hit bug in Ceph and potentially an XFS bug in the particular Linux kernel used on this storage pool. The five placement groups in question were not assigned to any storage pools, which is a state that should never happen. After talking with Ceph developers, the InkTank engineer provided us with steps to work around the bug.
Unfortunately, the resolution included losing the data stored on the five placement groups. The data loss materialized as lost sectors to virtual machines, which meant running fsck/chkdsk on hundreds of virtual machines. The other fall out is that the Exchange databases needed a lot of repair.
How we’re changing
Try as they may to be redundant, OpenStack and Ceph architecturally force non-obvious single points of failure. Ceph is a nice transition away from traditional storage, but at the end of the day it is just a different implementation of the same thing. SAN and Software Defined Storage are all single points of failure when used for virtual machine storage. OpenStack enabled us to scale massively with commodity hardware, but proved unsustainable operationally speaking.
Starting with our emergency cloud deployment, we’ve moved away from OpenStack and centralized storage. Instead, we’ve gone with Joyent’s SmartDataCenter 7. SmartDataCenter 7 has made some key architectural decisions that better fit with our infrastructure philosophies. Simply put, each physical host in SmartDataCenter 7 is capable of surviving on its own as long as power and network are available.
Even great products like SmartDataCenter 7 can’t run if our data center suffers a power, cooling, or connectivity failure, which is why we’ve been working hard the last few months to get our brand new Seattle-area data center online. Not only will we have redundant hardware in different geographic locations, we’ll also have far more Internet connectivity in Seattle. This will result in reduced latency for our customers and the ability to withstand routing failures at the Internet Service Provider level.
Over the last year, the Development and Operations departments at Faithlife have had a large cultural shift which includes increased collaboration, shared responsibility, the removal of artificial boundaries that create “not my problem” scenarios, and making tooling, automation and alerting first class products. Still, we have a lot of room for cultural growth. Admittedly, we knew our storage was running at or above its capability before disaster struck. However, because of a cultural tension between old and new, there was a real fear that changing anything at this critical time was more risky than just leaving things alone. This is a fallacy and Baron Schwartz points this out better than I can in his blog post Why Deployment Freezes Don’t Prevent Outages.
Our Operations team went through a full year of being stretched mentally and physically. Not only is that not healthy for Faithlife’s employees, it reduces our quality of work and decision making. So we’re adding more people to the teams that support Faithlife’s infrastructure and making a proper work / life balance one of the most important goals for 2015.
I can’t say enough about the amazing team we have here at Faithlife. Operations and Development came together and worked an insane amount of hours to mitigate and solve this massive problem in a very short amount of time.
Many of our customers left encouraging feedback in our forums during this outage, and I want to thank you all for that. The encouraging feedback was an uplift during a very trying time. Furthermore, thank you all for your business and understanding.