Hardware – Part II (Compute)

Compute hardware

Faithlife compute has gone through quite a few iterations in recent years. The transformation has been a critical piece of our success and ability to scale at costs that make sense to the business. Each iteration moves our deployment closer to being aligned with our overall philosophy.

Humble beginnings

Our first attempt at being more nimble, and reducing costs over our aging and expensive IBM physical server deployments, was VMware vCenter on an IBM Bladecenter backed by an IBM DS3500 SAN. Yes, you read that correctly, and yes, we may not have thought that decision through entirely.

Screen Shot 2015-03-25 at 3.04.02 PM

Plenty of flexibility was gained by virtualization, but the cost of the Bladecenter, Blades, SAN, and VMware licensing meant that even the smallest incremental addition to the infrastructure represented a dollar amount that needed lots of discussion before approval. These factors lead to projects being put on hold, developers not having the resources they needed, and Operations constantly battling an infrastructure running at or above capacity.

Commodity hardware, take one

Realizing that we were hamstrung by expensive hardware and licensing, we took to the basement and started a skunkworks project.

After a couple of weeks, and one hundred dollars, we emerged victoriously from the basement. We assembled twelve Dell Optiplex 960 workstations as OpenStack compute nodes, three APC “PDUs”, six Cisco desktop switches, and some really awesome 1Gb Ethernet, all on a Costco rack. Believe it or not, we actually replaced a few of our aging development servers with this for quite a few months. Though, I think we took commodity hardware a bit too seriously, and our datacenter wouldn’t allow us to deploy it in our cage.

2013-04-04 15.59.23

Commodity hardware, take two

Having prototyped OpenStack, and shown that it had the potential to both run on commodity hardware and replace our current virtualization stack, we moved forward with a small production deployment to help deal with some of our capacity issues.

Our initial production OpenStack deployment consisted of three controllers, three compute nodes, and eight Ceph nodes. This was also the beginning of our servers becoming multi-purpose Lego Bricks. We used Dell R610 1U servers in one of three different configurations for all things. Additionally, we started keeping some spare memory, disk, CPU, and R610 chassis on hand. Since we had spare parts and a single server kind, we could easily fix or replace any piece of our hardware infrastructure.

2014-01-13 17.06.55

The relatively low cost of the Dell R610 1U servers combined with free and open source virtualization meant we could finally remove the dam that was holding back additional gear. It took less than four months to go from the initial nine servers to one and a half racks full of gear.

During the build out we realized that our initial SAS based Ceph nodes did not have sufficient performance for database volumes and were too expensive for general purpose OS volumes. The solution was to add two new types of servers: Dell R620 filled with SSD, and Dell R510 filled with SATA.

2014-08-21 11.58.26

When OpenStack and Ceph went in to a death spiral and we transitioned to Joyent’s SmartDataCenter, we were able to reuse this same hardware for the emergency deployment with minor configuration changes and on hand parts (just one more reason Lego Bricks for hardware is so important).

Commodity hardware, take three (Joyent SmartDataCenter / current day)

Shortly before we transitioned to Joyent SmartDataCenter, we acquired space in a brand new datacenter. This gave us a nice green field to apply the last few years’ worth of hard earned lessons and also build specifically for SmartDataCenter. Lucky for us the great people at Joyent open sourced their build of materials, which gave us a higher degree of confidence that our new build would be successful (after all Joyent has already proven these builds in private and public clouds).

We really liked the Tenderloin-A/256 build based on price, disk performance, and density. Unfortunately the Tenderloin-A/256 build is based on SuperMicro parts, and we’re more comfortable with Dell servers; we have a great relationship with Redapt, a Dell partner who we purchase most of our hardware through. In that light, we worked with Redapt and Joyent to create a Dell build that is very close to the Tenderloin-A/256 Joyent build.

2015-01-07 21.22.29

Faithlife’s SmartDataCenter compute node build of materials

  • 1 x Dell R720 Chassis
  • 2 x Intel Xeon E-2650v2
  • 1 x iDRAC7 Enterprise
  • 1 x Intel X520 DP 10Gb DA/SFP+, + L350 DP 1Gb Ethernet Daughter Card
  • 1 x Intel / Dell SR SFP+ Optical Transceiver
  • 16 x Dell 16GB RDIMM 1866MT/s (256GB total)
  • 2 x 750W Power Supply
  • 1 x 200GB Intel DC S3700 SSD
  • 1 x Kingston 16GB USB stick
  • 1 x SuperMicro AOC-S2308L-L8E SAS controller
  • 15 x C10K900 HGST 2.5” 10K 600GB SAS

We’ve been running SmartDataCenter on this build with hundreds of VMs for a while now. The performance is outstanding; in fact, some of our VMs that previously needed dedicated SSD are just as happy on this SAS based configuration thanks to SmartOS zones and ZFS.

 

Hardware – Part I (Network)

The hardware powering Faithlife has seen a massive transformation in the last eighteen months. We’re really excited about all the cool new changes, and the measurable impact they’ve had on our employees, customers, and the products / features we’re able to offer. Given that, we thought that sharing our hardware configuration was a fun way to live our values and showcase what we think is pretty cool.

Philosophy

At Faithlife we value smart, versatile learners, and automation, over expensive vendor solutions. Smart, versatile learners don’t lose value when technology changes or the company changes direction, vendor solutions often do. If we can use commodity hardware and free open source software to replace expensive vendor solutions, we do.

Commodity hardware is generally re-configurable and reusable, and lets us treat our hardware like Lego Bricks. Free open source software allows us to see behind the curtain, and more easily work with other existing tools. We’re empowered to fix our own issues by utilizing the talent we already employ, not just sit on our hands waiting for a vendor support engineer to help us out (though we do like to keep that option available when possible). Additionally, combining commodity hardware with automation tools like Puppet, we’re able to be nimble.

By being nimble, leveraging in house talent, Lego Brick-ish hardware, and free open source software, we’re able to save a considerable amount of cash. Saving cash on operational expenses enables us to make business decisions that would have otherwise been cost prohibitive. At Faithlife we have large company problems, with a small company budget.

Network hardware

Not long ago we were exhausting a variety of Cisco, and F5 1Gb network gear. Bottlenecks were popping up left and right, packet loss was high, retransmits were through the roof, and changes to network hardware happened at a glacial pace. We were beyond the limits of 1Gb, our topology was problematic, and shortcuts were continually being taken in order to keep up with the demand of our sites and services. At the same time, we had just begun the process of moving to Puppet and automating our server deployments, which meant we could easily outpace network changes. Additionally, the gear did not a fit our hardware philosophy.

Fast forward to today, our current data center topology is a modified spine and leaf, or “folded clos” design. We use OSPF to route traffic between cabinets and a pair of leaf switches are placed in each cabinet. The leaf switch pairs represent a layer 2 boundary and allow us to MLAG our servers to maintain switch redundancy within the layer 2 boundary. In addition, a pair of spine switches are placed in an end of row networking cabinet. We have multiple edge routers and firewalls connected to an area border router via OSPF. Furthermore, the edge routers are connected to ISPs via BGP.

Spine

Dell S6000-ON and Penguin Arctica 3200XL — both run Cumulus Linux

  • 32 Ports of 40Gb QSFP+

Leaf / Area Border Router

Dell S4810-ON and Penguin Arctica 4804X — both run Cumulus Linux

  • 48 Ports of 10Gb SFP+ plus 4 ports of 40Gb QSFP+

Management

Penguin Arctica 4804i — running Cumulus Linux

  • 48 Ports of 1Gb plus 4 ports of 10Gb SFP+

Edge Router / Firewall

Dell R610 1U Servers:

  • Dual Intel X520-DA2 NIC with Intel SFP+ optics
  • Dual Intel X5650 CPU
  • 96GB of RAM (Helps with Internet routing tables, IPS, firewall states, etc.)

Routers run Ubuntu Linux with Quagga for OSPF and BGP.

Firewalls run PFSense (FreeBSD based) with Quagga for OSPF, and Suricata for IPS.

Cables

Amphenol 10Gb SFP+ DAC

Amphenol 40Gb QSFP+ DAC

Amphenol 40Gb QSFP+ Active Optical

FiberStore multi-mode fiber

Transceivers

FiberStore 10Gb SR Optics

Intel 10Gb SR and LR Optics (for compatibility with X520-DA2 cards)

Seattle Data Center Network Cabinet

(please excuse the screwdriver and loose fiber, this was a work in progress at the time)

SeattleNetworkRack