Faithlife’s sdc-portal

Today we’re pleased to announce that our developer / customer facing portal for Joyent‘s SmartDataCenter 7 has been open sourced.

After transitioning from OpenStack to SDC 7 the only thing we were left wanting was a portal for developers and non-operations staff (SDC already has a portal for Admin and Operations engineers). Luckily, SDC has a fantastic set of APIs that we’ve leveraged to create sdc-portal. The portal started out of necessity because our developers were used to having at least the ability to start, stop and reboot their VMs, but it is growing in to much more. Today we want to give back to the open source community that has helped us immensely and invite others to help us make sdc-portal even better.

The portal is in its infancy, and we’ll be iterating on the documentation and feature set rapidly in the next few weeks. However; we’ve been encouraged by Joyent and a few other organizations to make the code available now due to the high demand and the large set of people that are eager to contribute.

Today the portal supports the following features:

  • OAuth sign-in
  • Integration with SmartDataCenter 7 and Joyent’s public cloud
  • Start, stop and reboot VMs
  • Get the current status and information about VMs

In the next few days and weeks we’ll be adding the following features:

  • Generic authentication provider support
  • VM provisioning
  • SSH key management
  • Things the community comes up with…

We welcome your feedback in our sdc-portal group, in the #smartos IRC channel, or in the form of GitHub issues. We also welcome your pull requests!

sdc-portal lab

Hardware – Part II (Compute)

Compute hardware

Faithlife compute has gone through quite a few iterations in recent years. The transformation has been a critical piece of our success and ability to scale at costs that make sense to the business. Each iteration moves our deployment closer to being aligned with our overall philosophy.

Humble beginnings

Our first attempt at being more nimble, and reducing costs over our aging and expensive IBM physical server deployments, was VMware vCenter on an IBM Bladecenter backed by an IBM DS3500 SAN. Yes, you read that correctly, and yes, we may not have thought that decision through entirely.

Screen Shot 2015-03-25 at 3.04.02 PM

Plenty of flexibility was gained by virtualization, but the cost of the Bladecenter, Blades, SAN, and VMware licensing meant that even the smallest incremental addition to the infrastructure represented a dollar amount that needed lots of discussion before approval. These factors lead to projects being put on hold, developers not having the resources they needed, and Operations constantly battling an infrastructure running at or above capacity.

Commodity hardware, take one

Realizing that we were hamstrung by expensive hardware and licensing, we took to the basement and started a skunkworks project.

After a couple of weeks, and one hundred dollars, we emerged victoriously from the basement. We assembled twelve Dell Optiplex 960 workstations as OpenStack compute nodes, three APC “PDUs”, six Cisco desktop switches, and some really awesome 1Gb Ethernet, all on a Costco rack. Believe it or not, we actually replaced a few of our aging development servers with this for quite a few months. Though, I think we took commodity hardware a bit too seriously, and our datacenter wouldn’t allow us to deploy it in our cage.

2013-04-04 15.59.23

Commodity hardware, take two

Having prototyped OpenStack, and shown that it had the potential to both run on commodity hardware and replace our current virtualization stack, we moved forward with a small production deployment to help deal with some of our capacity issues.

Our initial production OpenStack deployment consisted of three controllers, three compute nodes, and eight Ceph nodes. This was also the beginning of our servers becoming multi-purpose Lego Bricks. We used Dell R610 1U servers in one of three different configurations for all things. Additionally, we started keeping some spare memory, disk, CPU, and R610 chassis on hand. Since we had spare parts and a single server kind, we could easily fix or replace any piece of our hardware infrastructure.

2014-01-13 17.06.55

The relatively low cost of the Dell R610 1U servers combined with free and open source virtualization meant we could finally remove the dam that was holding back additional gear. It took less than four months to go from the initial nine servers to one and a half racks full of gear.

During the build out we realized that our initial SAS based Ceph nodes did not have sufficient performance for database volumes and were too expensive for general purpose OS volumes. The solution was to add two new types of servers: Dell R620 filled with SSD, and Dell R510 filled with SATA.

2014-08-21 11.58.26

When OpenStack and Ceph went in to a death spiral and we transitioned to Joyent’s SmartDataCenter, we were able to reuse this same hardware for the emergency deployment with minor configuration changes and on hand parts (just one more reason Lego Bricks for hardware is so important).

Commodity hardware, take three (Joyent SmartDataCenter / current day)

Shortly before we transitioned to Joyent SmartDataCenter, we acquired space in a brand new datacenter. This gave us a nice green field to apply the last few years’ worth of hard earned lessons and also build specifically for SmartDataCenter. Lucky for us the great people at Joyent open sourced their build of materials, which gave us a higher degree of confidence that our new build would be successful (after all Joyent has already proven these builds in private and public clouds).

We really liked the Tenderloin-A/256 build based on price, disk performance, and density. Unfortunately the Tenderloin-A/256 build is based on SuperMicro parts, and we’re more comfortable with Dell servers; we have a great relationship with Redapt, a Dell partner who we purchase most of our hardware through. In that light, we worked with Redapt and Joyent to create a Dell build that is very close to the Tenderloin-A/256 Joyent build.

2015-01-07 21.22.29

Faithlife’s SmartDataCenter compute node build of materials

  • 1 x Dell R720 Chassis
  • 2 x Intel Xeon E-2650v2
  • 1 x iDRAC7 Enterprise
  • 1 x Intel X520 DP 10Gb DA/SFP+, + L350 DP 1Gb Ethernet Daughter Card
  • 1 x Intel / Dell SR SFP+ Optical Transceiver
  • 16 x Dell 16GB RDIMM 1866MT/s (256GB total)
  • 2 x 750W Power Supply
  • 1 x 200GB Intel DC S3700 SSD
  • 1 x Kingston 16GB USB stick
  • 1 x SuperMicro AOC-S2308L-L8E SAS controller
  • 15 x C10K900 HGST 2.5” 10K 600GB SAS

We’ve been running SmartDataCenter on this build with hundreds of VMs for a while now. The performance is outstanding; in fact, some of our VMs that previously needed dedicated SSD are just as happy on this SAS based configuration thanks to SmartOS zones and ZFS.


Service Interruption 1/27/15

Between 9:50PM PST and 10:28PM PST on January 27th 2015, most Faithlife sites and service were unable to talk to the public Internet. We’re sorry for the interruption this caused and we’re taking steps to prevent the likelihood of this happening again.


Our edge routers needed to be patched, due to the “Ghost” glibc vulnerability or CVE-2015-0235. The patching process of our primary edge router froze while updating Quagga, the daemon responsible for BGP and OSPF. The frozen patch process was subsequently killed, which unexpectedly killed the active Quagga daemon. When the Quagga daemon stops, that node is no longer able to advertise our ASN and public subnet. Normally, this should result in a very small interruption, because our secondary edge router should start advertising our public subnet to its already established BGP session with a different ISP. Unfortunately, we are in the process of making large changes to our secondary and a few of the more important routes were misconfigured. This yielded the secondary edge router mostly unusable. Because the patch was being applied remotely over a VPN connection that relies on OSPF to talk to the router, the router was inaccessible. Due to the inaccessibility of the primary edge router, we drove to the data center immediately, physically connected to the machine, completed the patching, and restarted the Quagga daemon.

What We’re Doing

We’re currently going through a re-configuration of our edge routers and firewalls which will enable us to advertise our ASN and public subnet from multiple geographically diverse locations with different Internet Service Providers. This is actually a project that we hoped to have completed before going in to 2015, but contracts and difficulties with the physical layer proved tougher than expected. Once this is complete, an issue like this should only cause a very small interruption of service for a subset of our users. Additionally, we’ll be adding console switches with multiple out of band connectivity options so that we shouldn’t have to worry about burning the time it takes to run to the datacenter or create a remote hands ticket.

Hardware – Part I (Network)

The hardware powering Faithlife has seen a massive transformation in the last eighteen months. We’re really excited about all the cool new changes, and the measurable impact they’ve had on our employees, customers, and the products / features we’re able to offer. Given that, we thought that sharing our hardware configuration was a fun way to live our values and showcase what we think is pretty cool.


At Faithlife we value smart, versatile learners, and automation, over expensive vendor solutions. Smart, versatile learners don’t lose value when technology changes or the company changes direction, vendor solutions often do. If we can use commodity hardware and free open source software to replace expensive vendor solutions, we do.

Commodity hardware is generally re-configurable and reusable, and lets us treat our hardware like Lego Bricks. Free open source software allows us to see behind the curtain, and more easily work with other existing tools. We’re empowered to fix our own issues by utilizing the talent we already employ, not just sit on our hands waiting for a vendor support engineer to help us out (though we do like to keep that option available when possible). Additionally, combining commodity hardware with automation tools like Puppet, we’re able to be nimble.

By being nimble, leveraging in house talent, Lego Brick-ish hardware, and free open source software, we’re able to save a considerable amount of cash. Saving cash on operational expenses enables us to make business decisions that would have otherwise been cost prohibitive. At Faithlife we have large company problems, with a small company budget.

Network hardware

Not long ago we were exhausting a variety of Cisco, and F5 1Gb network gear. Bottlenecks were popping up left and right, packet loss was high, retransmits were through the roof, and changes to network hardware happened at a glacial pace. We were beyond the limits of 1Gb, our topology was problematic, and shortcuts were continually being taken in order to keep up with the demand of our sites and services. At the same time, we had just begun the process of moving to Puppet and automating our server deployments, which meant we could easily outpace network changes. Additionally, the gear did not a fit our hardware philosophy.

Fast forward to today, our current data center topology is a modified spine and leaf, or “folded clos” design. We use OSPF to route traffic between cabinets and a pair of leaf switches are placed in each cabinet. The leaf switch pairs represent a layer 2 boundary and allow us to MLAG our servers to maintain switch redundancy within the layer 2 boundary. In addition, a pair of spine switches are placed in an end of row networking cabinet. We have multiple edge routers and firewalls connected to an area border router via OSPF. Furthermore, the edge routers are connected to ISPs via BGP.


Dell S6000-ON and Penguin Arctica 3200XL — both run Cumulus Linux

  • 32 Ports of 40Gb QSFP+

Leaf / Area Border Router

Dell S4810-ON and Penguin Arctica 4804X — both run Cumulus Linux

  • 48 Ports of 10Gb SFP+ plus 4 ports of 40Gb QSFP+


Penguin Arctica 4804i — running Cumulus Linux

  • 48 Ports of 1Gb plus 4 ports of 10Gb SFP+

Edge Router / Firewall

Dell R610 1U Servers:

  • Dual Intel X520-DA2 NIC with Intel SFP+ optics
  • Dual Intel X5650 CPU
  • 96GB of RAM (Helps with Internet routing tables, IPS, firewall states, etc.)

Routers run Ubuntu Linux with Quagga for OSPF and BGP.

Firewalls run PFSense (FreeBSD based) with Quagga for OSPF, and Suricata for IPS.


Amphenol 10Gb SFP+ DAC

Amphenol 40Gb QSFP+ DAC

Amphenol 40Gb QSFP+ Active Optical

FiberStore multi-mode fiber


FiberStore 10Gb SR Optics

Intel 10Gb SR and LR Optics (for compatibility with X520-DA2 cards)

Seattle Data Center Network Cabinet

(please excuse the screwdriver and loose fiber, this was a work in progress at the time)



Storage unavailability Friday November 21st – 26th, 2014

I’d like to apologize for the trouble you undoubtedly had accessing Faithlife products and services between November 21st and 26th. The reliability and availability of Faithlife products and services is critical to your success and ours. Understanding what happened is a necessary step towards reducing the probability of this type of event happening again.

Summary of events (All times approximate)

4:00 PM Pacific on November 21st, a storage pool in our Bellingham data center had three of fifty-five drives marked down and out due to a failure to respond within five minutes to the rest of the cluster. Since our storage pools are configured to be triple-redundant, the cluster began a rebalance of its data to ensure the triple-redundant guarantee. Normally, a three drive failure and rebalance would be a minor inconvenience. Unfortunately, so many virtual machines had been provisioned on this pool during the Logos 6 launch that IOPS demands on the pool were already at or above the pools capability. The result was slow, but available disk. The three problematic disks were identified, but our logs and monitoring software did not point to an actual disk failure. The problem disks were manually marked down and out to prevent them from coming back in the cluster. Since there was plenty of redundant disk and things were functional, the plan was to replace the problem disks the next morning.

10:45 PM Pacific on November 21st, the rebalance stalled and disk operations were extremely degraded. Stalled object storage daemons were re-started one at a time. The rebalance continued and storage was somewhat usable again.

2:30 AM Pacific on November 22nd, four more drives were marked down and out. Enough disks had been lost that a large portion of the storage pool was experiencing paused disk operations as a protection against data loss. This event took a large portion of our web infrastructure down and left only a few systems able to function in a degraded state. Our monitoring systems did not produce data that suggested these disks were unhealthy. However, operating system logs pointed to problems with the XFS partitions. Further investigation showed that the disk controllers marked these four drives as critical and that one of the controllers had its battery backed cache die. The four failed drives were manually marked down and out, and we headed to the data center to build up a new storage pool node. This node was to take the place of the failed drives, allow the cluster to start healing, and unpause disk operations. We also planned to immediately replace the controller with the failed battery backed cache.

6:30 AM Pacific on November 22nd, when replacing the failed battery backed cache, the power cord for one of the active and healthy nodes was accidentally pulled. The node was immediately brought back into service, but the sudden power loss resulted in two journal partitions becoming corrupt and the loss of the object storage daemons backing them. This brought the lost drive count to nine of fifty-five and furthered the degraded state of the pool. The battery backed cache was properly replaced, and the new storage node was added by approximately 9:30 AM Pacific. The rebalance was able to continue, but at a very slow rate. We estimated it would take thirty-six hours before the pool was in a usable state. All available production resources had been consumed by the Logos 6 launch, and the decision was made to pull all resources from our on premise lab and build out a parallel cloud deployment. This would allow us to quickly replace affected virtual machines while the storage pool recovered. In the meantime, virtual machines hosted by the affected storage pool were shut down to prevent them from servicing live requests when they were periodically available.

9:00 PM Pacific on November 22nd, gear was obtained, racked and provisioned at our Bellingham data center. Proclaim and Commerce related sites and services were chosen as first recipients of the new deployment.

11:00 PM Pacific on November 22nd, all of Proclaim and its related sites and services were functional. Commerce related virtual machines were provisioned and awaiting final configuration and code deployment. Other sites and services were provisioned and deployed as new hardware became available in the following days.

Between November 24th and November 25th, functionality had been restored to all but our Exchange deployment. We did not want to restore Exchange from backup on to alternative deployment because it meant losing some email. Our efforts turned entirely to successful recovery of the storage pool.

The storage pool rebalance had essentially finished, but writes were still paused. The pool had five incomplete and stuck placement groups, and hundreds of slow requests. Hope of a normal recovery was gone and we began working through documentation for troubleshooting slow requests and incomplete placement groups.

The documentation pointed us at four possible causes: a bad disk, file system/kernel bug, overloaded cluster, or an object storage daemon bug. It also proposed four possible resolutions: shutdown virtual machines to reduce load, upgrade the kernel, upgrade Ceph, or restart object storage daemons with slow requests. Disks were replaced, virtual machines were already shut off, and ceph was upgraded. Upgrading the kernel was not an appealing option because restarts would be required. Restarts meant either letting a rebalance happen while the drives went away, or placing the cluster in a no-recover state. Further rebalancing would put more stress on disks and put us at risk of losing more drives. Putting the cluster in a no-recover state, even momentarily, seemed inappropriate. Since it appeared that the five incomplete placement groups were causing the paused writes, the decision was made to mark the placement groups lost and deal with any potential data loss. Unfortunately, the cluster refused to respect marking these placement groups as lost. At this point we worked on the assumption that we’d hit a bug in Ceph and engaged the Ceph IRC channel, which proved unhelpful.

We felt as if our options consisted of digging in to Ceph source code, or engaging InkTank support. We felt it necessary to make engaging InkTank support the first step. We were lucky enough to get six hours of free support from InkTank while they set up our newly purchased support contract. Their engineer walked through many of the same steps we had, and we were able to provide them with output and logs that accelerated their troubleshooting. It was decided by the InkTank engineer that we had hit bug in Ceph and potentially an XFS bug in the particular Linux kernel used on this storage pool. The five placement groups in question were not assigned to any storage pools, which is a state that should never happen. After talking with Ceph developers, the InkTank engineer provided us with steps to work around the bug.

Unfortunately, the resolution included losing the data stored on the five placement groups. The data loss materialized as lost sectors to virtual machines, which meant running fsck/chkdsk on hundreds of virtual machines. The other fall out is that the Exchange databases needed a lot of repair.

How we’re changing

Try as they may to be redundant, OpenStack and Ceph architecturally force non-obvious single points of failure. Ceph is a nice transition away from traditional storage, but at the end of the day it is just a different implementation of the same thing. SAN and Software Defined Storage are all single points of failure when used for virtual machine storage. OpenStack enabled us to scale massively with commodity hardware, but proved unsustainable operationally speaking.

Starting with our emergency cloud deployment, we’ve moved away from OpenStack and centralized storage. Instead, we’ve gone with Joyent’s SmartDataCenter 7. SmartDataCenter 7 has made some key architectural decisions that better fit with our infrastructure philosophies. Simply put, each physical host in SmartDataCenter 7 is capable of surviving on its own as long as power and network are available.

Even great products like SmartDataCenter 7 can’t run if our data center suffers a power, cooling, or connectivity failure, which is why we’ve been working hard the last few months to get our brand new Seattle-area data center online. Not only will we have redundant hardware in different geographic locations, we’ll also have far more Internet connectivity in Seattle. This will result in reduced latency for our customers and the ability to withstand routing failures at the Internet Service Provider level.

Over the last year, the Development and Operations departments at Faithlife have had a large cultural shift which includes increased collaboration, shared responsibility, the removal of artificial boundaries that create “not my problem” scenarios, and making tooling, automation and alerting first class products. Still, we have a lot of room for cultural growth. Admittedly, we knew our storage was running at or above its capability before disaster struck. However, because of a cultural tension between old and new, there was a real fear that changing anything at this critical time was more risky than just leaving things alone. This is a fallacy and Baron Schwartz points this out better than I can in his blog post Why Deployment Freezes Don’t Prevent Outages.

Our Operations team went through a full year of being stretched mentally and physically. Not only is that not healthy for Faithlife’s employees, it reduces our quality of work and decision making. So we’re adding more people to the teams that support Faithlife’s infrastructure and making a proper work / life balance one of the most important goals for 2015.

I can’t say enough about the amazing team we have here at Faithlife. Operations and Development came together and worked an insane amount of hours to mitigate and solve this massive problem in a very short amount of time.

Thank you

Many of our customers left encouraging feedback in our forums during this outage, and I want to thank you all for that. The encouraging feedback was an uplift during a very trying time. Furthermore, thank you all for your business and understanding.