Winbind in SmartOS, Part III (Polishing)

You now have a base-32 or base-64 zone running Winbind, what do you need to take it the rest of the way? We will cover a number of topics in this post, pick the ones that are relevant to your environment:

A github repository has been made available for this series, including all relevant configuration files and a setup script that summarizes commands issued (aside from a few points in this post). The script is not intended for production use, but is good for a simple test and for those that read code more readily than blog posts.

Creating the Winbind Service

We have manually started Winbind with the “winbindd” command up to this point, let’s remedy that by creating an SMF service for the Winbind process. First, create a manifest file for the service (we can save it as ~/winbind.xml):

<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type='manifest' name='export'>
  <service name='network/winbind' type='service' version='0'>
    <create_default_instance enabled='true'/>
    <single_instance/>
    <dependency name='fs' grouping='require_all' restart_on='none' type='service'>
      <service_fmri value='svc:/system/filesystem/local'/>
    </dependency>
    <dependency name='network' grouping='optional_all' restart_on='none' type='service'>
      <service_fmri value='svc:/milestone/network'/>
    </dependency>
    <exec_method name='refresh' type='method' exec=':kill -HUP' timeout_seconds='60'/>
    <exec_method name='start' type='method' exec='/opt/local/sbin/winbindd' timeout_seconds='60'/>
    <exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'/>
    <template>
      <common_name>
        <loctext xml:lang='C'>Winbind</loctext>
      </common_name>
    </template>
  </service>
</service_bundle>

Stop our current winbind process, then import the manifest using svccfg:

# pkill winbind
# svccfg import ~/winbind.xml
# 

We now have Winbind running as an SMF service:

# svcs winbind
STATE          STIME    FMRI
online         19:04:24 svc:/network/winbind:default
# 

Reducing the Authentication Delay

If you are testing Winbind on a larger domain, you may notice that the getent commands and login take a decent amount of patience. We had the Winbind user and group enumeration options enabled for testing purposes, but they are not strictly necessary for the majority of operations (we haven’t yet encountered a case where they were needed). Remove these lines from /opt/local/etc/samba/smb.conf:

        winbind enum groups = yes
        winbind enum users = yes

Restart the winbind service that we just created to apply the new configuration:

# svcadm refresh winbind
# 

Note that you can still use getent to query for particular users or groups by specifying the user or group name in the command, even if the general getent commands will no longer display all domain users or groups.

Creating Home Directories

If you look back at the end result of the previous posts, you will notice this error on login:

Could not chdir to home directory /home/[user]: No such file or directory

We have the home directory configured, but have nothing set up to create individual user directories.

Troubleshooting

Linux has a PAM module called “pam_mkhomedir”, and it seems like a straight-forward solution but is not provided or supported in SmartOS. Compiling the module in Solaris has been done, but we don’t want to support that kind of work through our configuration management solution (puppet) due to the amount of hack required.

Winbind PAM has a “mkhomedir” option, but it has a fatal flaw. Here is what login looks like with the option set:

$ ssh [user]@[host]
Password:
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; Instance (base-64-lts 15.4.0)
                   `-'  https://docs.joyent.com/images/smartos/base

-bash-4.1$ ls -al
total 2
drwx------   2 [user] [group]       2 Feb 26 19:19 .
drwxr-xr-x   4 root     root           4 Feb 26 19:19 ..
-bash-4.1$ echo $PATH
/usr/ccs/bin:/usr/bin:/bin:/usr/sbin:/sbin
-bash-4.1$ 

The home directory is created, but it is completely empty and we are left with bash defaults for everything. The PATH variable is especially unhelpful, as we will be missing out on around 400 commands with the default value (including sudo).

The autofs service was originally intended for auto-mounting of network shares, but it can be tweaked for the purposes of auto-generation of user home directories. Using a modified version of Znogger’s auto_homedir script, we will create a home directory for each user (as needed) based on a copy of the /etc/skel directory (very important, it includes good shell defaults including a good PATH value). Save this script to /etc/auto_home (overwrite the file that is there):

#!/usr/bin/bash

HOMEDIRPATH=/home
PHYSICALDIRPATH=/export/home

# The folder being requested is first argument, we assume at
# this point that it is a user
user=$1
group=$(id -gn "$user")
home_dir=$(getent passwd "$user" | cut -d: -f6)
physical_dir="$PHYSICALDIRPATH/$user"

# Only create a directory if it doesn't exist
if [ ! -d "$physical_dir" ]; then
  # Sanity check, ensure that this is actually a user and that
  # his home directory is in the expected location
  if [[ "$home_dir" != $HOMEDIRPATH/* ]]; then
    exit
  fi

  mkdir -p "$PHYSICALDIRPATH"
  # Use the shell defaults in /etc/skel to populate the initial
  # home directory
  cp -r /etc/skel "$physical_dir"
  chown -R "$user":"$group" "$physical_dir"
fi

echo "localhost:$physical_dir"

Make the script executable:

# chmod +x /etc/auto_home
# 

Enable the bind and autofs services (autofs needs bind to be running):

# svcadm enable bind
# svcadm enable autofs
# 

Now try logging in:

$ ssh [user]@[host]
Password:
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; Instance (base-64-lts 15.4.0)
                   `-'  https://docs.joyent.com/images/smartos/base

[user@host ~]$ ll -a
total 12
drwxr-xr-x 3 26467 27831  11 Feb 26 19:47 ./
dr-xr-xr-x 2 root  root    2 Feb 26 19:45 ../
-rw-r--r-- 1 26467 27831  76 Feb 26 19:47 .bash_profile
-rw-r--r-- 1 26467 27831 304 Feb 26 19:47 .bashrc
-rw-r--r-- 1 26467 27831 151 Feb 26 19:47 .cshrc
-rw-r--r-- 1 26467 27831  38 Feb 26 19:47 .curlrc
-rw-r--r-- 1 26467 27831 240 Feb 26 19:47 .irbrc
-rw-r--r-- 1 26467 27831  40 Feb 26 19:47 .login
-rw-r--r-- 1 26467 27831 690 Feb 26 19:47 .profile
drwxr-xr-x 2 26467 27831   3 Feb 26 19:47 .ssh/
-rw-r--r-- 1 26467 27831 846 Feb 26 19:47 .vimrc
[user@host ~]$ echo $PATH
/usr/local/sbin:/usr/local/bin:/opt/local/sbin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin
[user@host ~]$ 

Much better!

How did that work?

You can see Znogger’s post for a more detailed explanation of the autofs process. We don’t need step 1 (modifying the /etc/auto_master file) in his instructions, since /etc/auto_home is configured by default in SmartOS to be called for /home. If you wanted to have several auto-mounted sources for /home, or use a different home directory, you will need to store the script in a different location and modify /etc/auto_master or the existing /etc/auto_home file instead.

Limiting Access

For the needs of the Faithlife environment, we want to limit SSH access based on Active Directory group membership. Setting that up in Winbind is straightforward, but let’s lay out some groundwork by moving pam_winbind settings to a separate settings file before it gets too long. Remove “use_first_pass” from pam.conf:

# sed -i 's/ use_first_pass//g' /etc/pam.conf
# 

Then create our initial pam_winbind configuration at /etc/security/pam_winbind.conf with this as the content:

[global]

try_first_pass = yes

We can now add a membership limitation in that file (get the group name or guid from getent groups list, and comma-separate if permitting multiple groups):

# echo 'require_membership_of = [group]' >> /etc/security/pam_winbind.conf
# 

Granting Sudo Access

Winbind access does not allow us to use the more granular features provided by the Solaris RBAC system, so we will need to go to the old admin standby: sudo. Nothing out of the ordinary for Winbind, we just need to add any groups or users into the sudoers file so that they can take administrative actions.

We’ll just append to the sudoers file here, but using the visudo command is recommended when manually editing sudoers. For a group (escape spaces and backslashes with backslashes):

# echo '%[group] ALL=(ALL) ALL' >> /opt/local/etc/sudoers
# 

For a user (also escaped if necessary):

# echo '[user] ALL=(ALL) ALL' >> /opt/local/etc/sudoers
# 

With that change, we now have sudo access:

[user@host ~]$ sudo su -
Password:
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; Instance (base-32-lts 15.4.0)
                   `-'  https://docs.joyent.com/images/smartos/base

[root@host ~]# 

Working with More than 16 Group Memberships

If your Active Directory domain users tend to have a significant number of security-group memberships, you may have already encountered this issue when granting sudo access: SmartOS is based on a version of Solaris that (by default) has a group membership limit of 16, and any AD security group memberships beyond 16 are truncated. The limitation does not affect basic SSH access, since the Winbind PAM doesn’t consult NSS for the membership list, but sudo will be blocked if the relevant group was not included in the 16.

The truncation isn’t realistically controllable, and although increasing the group limit is possible it requires grub-level changes (and a reboot) to make that change. That change becomes even more difficult in an SDC or Triton environment, and impossible in the JPC environment.

Our workaround hinges on one predictable aspect of the truncation: the primary user group is never cut. We ensure that the primary AD group for each AD user is the one with the most significant for SSH access.

Eliminating the SSH Key Loophole for Disabled Accounts

Using RSA keys to access SmartOS zones based on domain accounts is possible and a great boost to productivity while working with batches of servers. Unfortunately, use of RSA keys reveals a surprising limitation between PAM and SSH: if an RSA key is used for access, some of the PAM sections are not consulted. As a result, a disabled account will not be denied access by Winbind if an RSA key is utilized that was previously added to ~/.ssh/authorized_keys for that user.

Our workaround for that security hole is fairly simple: we utilized an Active Directory group for “Disabled Users” and block members of that group through the SSH service configuration. Do that by setting the DenyGroups value in sshd_config:

# echo 'DenyGroups "disabled users"' >> /etc/ssh/sshd_config
# svcadm refresh ssh
#

Note that the 16-group-membership limit applies here, so if you leave group memberships applied to the disabled account you should set the “Disabled Users” group as the primary security group.

Updating Passwords

Winbind PAM does support password changes via passwd and during login, but under SmartOS we don’t have that luxury:

[user@host ~]$ passwd
passwd: Changing password for [user]
passwd: Unsupported nsswitch entry for "passwd:". Use "-r repository ".
Unexpected failure. Password file/table unchanged.

Solaris, and by lineage SmartOS, only supports a few configurations for password changes (a hard-coded limitation):

Only five passwd configurations are permitted:

  • passwd: files
  • passwd: files nis
  • passwd: files nisplus
  • passwd: compat
  • passwd: compat passwd_compat: nisplus

There is no real fix for this limitation, but kpasswd (the Kerberos equivalent to passwd command) can be used to change the domain password for a user. The kpasswd command does not work for password changes on login.

Working with Long Hostnames

Since we are joining an Active Directory domain, a familiar limitation comes into play for hosts with longer hostnames:

# net join -k -U [user]
Our netbios name can be at most 15 chars long, "FAIRLY-LONG-HOSTNAME" is 20 chars long
Invalid configuration.  Exiting....
Failed to join domain: The format of the specified computer name is invalid.
ADS join did not work, falling back to RPC...
Our netbios name can be at most 15 chars long, "FAIRLY-LONG-HOSTNAME" is 20 chars long

We can’t go changing Microsoft now, so we need to change the registration name of the host we are trying to join to the network. We can do that via the “netbios name” setting in /opt/local/etc/smb.conf:

# echo '        netbios name = shorter-name' >> /opt/local/etc/samba/smb.conf
# net join -k -U [user]
Enter [user]'s password:
Using short domain name -- LRSCORP
Joined 'SHORTER-NAME' to dns domain 'lrscorp.net'
# 

The kind of truncation that you can do on the hostname depends on your environment, in ours we removed dashes to get all our zone hostnames under the limit.

Allow Normal Users to Query Users and Groups

After testing things out as a normal user, you may notice that the user can’t see his own user or group names:

$ whoami
whoami: cannot find name for user ID 23456
$ id
uid=23456 gid=65432 groups=65432,76543
Troubleshooting

Truss is once again a key to figuring out why things aren’t working. Let’s use our truss-and-grep method from part 1:

$ truss whoami 2>&1 >/dev/null | grep winbind
stat("/lib/64/nss_winbind.so.1", 0xFFFFFD7FFFDFF090) Err#2 ENOENT
stat("/usr/lib/64/nss_winbind.so.1", 0xFFFFFD7FFFDFF090) Err#2 ENOENT
stat("/opt/local/lib/nss_winbind.so.1", 0xFFFFFD7FFFDFF090) = 0
resolvepath("/opt/local/lib/nss_winbind.so.1", "/opt/local/lib/libnss_winbind.so", 1023) = 32
open("/opt/local/lib/nss_winbind.so.1", O_RDONLY) = 4
stat("/opt/local/lib//libwinbind-client-samba4.so", 0xFFFFFD7FFFDFEF60) Err#2 ENOENT
stat("/opt/local/gcc49/x86_64-sun-solaris2.11/lib/amd64/libwinbind-client-samba4.so", 0xFFFFFD7FFFDFEF60) Err#2 ENOENT
stat("/opt/local/gcc49/lib/amd64/libwinbind-client-samba4.so", 0xFFFFFD7FFFDFEF60) Err#2 ENOENT
stat("/opt/local/lib/samba/private/libwinbind-client-samba4.so", 0xFFFFFD7FFFDFEF60) Err#13 EACCES [file_dac_search]
[...]
$

The EACCES error seems fairly self-explanatory, let’s check the permissions on that library:

$ ll -d /opt/local/lib/samba/private/libwinbind-client-samba4.so
ls: cannot access /opt/local/lib/samba/private/libwinbind-client-samba4.so: Permission denied
$ ll -d /opt/local/lib/samba/private
drwx------ 3 root root 104 Mar 10 20:47 /opt/local/lib/samba/private/
$

Allow regular users to read and execute the “private” winbind libraries (from the root user):

# chmod 755 /opt/local/lib/samba/private
# chmod 755 /opt/i386/opt/local/lib/samba/private
#

User info is now available:

$ whoami
[user]
$ id
uid=23456([user]) gid=65432([primary-group]) groups=65432([primary-group]),76543([other-group])

Winbind in SmartOS, Part II (Running in Base-64)

We left off last time with a very basic (but working) Winbind deployment. In this post, we will focus on getting that same basic Winbind functionality in a base-64 zone. If you want to try polishing first, you can skip this post for now and use your base-32 zone for part 3.

Starting with a base-64 zone, we can start following the steps listed in part 1 of the series. Returning here when things start to go sideways, and we’ll work out the steps necessary for the new architecture.

Following NSS configuration, the “getent” commands still aren’t picking up on Active Directory users or groups, even after adding the /opt/local/lib path to ld.config.

Troubleshooting

Following the troubleshooting steps from post 1 (disable name-service-cache, grep truss for winbind) doesn’t reveal anything crazy: getent finds and loads our “nss_winbind.so.1” link we created. Running truss without grep, however, reveals something interesting:

# truss getent passwd 2>&1 >/dev/null | tail
stat64("/usr/lib/nss_winbind.so.1", 0x08047110) Err#2 ENOENT
stat64("/opt/local/lib/nss_winbind.so.1", 0x08047110) = 0
resolvepath("/opt/local/lib/nss_winbind.so.1", "/opt/local/lib/libnss_winbind.so", 1023) = 32
open("/opt/local/lib/nss_winbind.so.1", O_RDONLY) = 3
mmapobj(3, MMOBJ_INTERPRET, 0xFED60790, 0x0804717C, 0x00000000) Err#48 ENOTSUP
mmap(0x00000000, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0xFED50000
munmap(0xFED50000, 4096)                        = 0
close(3)                                        = 0
open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_SGS.mo", O_RDONLY) Err#2 ENOENT
_exit(0)

An error is produced when getent tries to load nss_winbind (“mmapobj(3, […]) Err#48 ENOTSUP”). Unfortunately, a quick search for that error doesn’t give us anything useful. Time to roll up our sleeves and read the mmapobj documentation:

Errors

The mmapobj() function will fail if:

[…]

ENOTSUP

The current user data model does not match the fd to be interpreted. For example, a 32-bit process that tried to use mmapobj() to interpret a 64-bit object would return ENOTSUP.

The flags argument contains MMOBJ_INTERPRET and the fd argument is a file whose type can not be interpreted.

The ELF header contains an unaligned e_phentsize value.

Are we running a 32-bit process and trying to interpret a 64-bit object?

# which getent
/usr/bin/getent
# file /usr/bin/getent
/usr/bin/getent:        ELF 32-bit LSB executable 80386 Version 1, dynamically linked, not stripped, no debugging information available
# file /opt/local/lib/libnss_winbind.so
/opt/local/lib/libnss_winbind.so:       ELF 64-bit LSB dynamic lib AMD64 Version 1, dynamically linked, not stripped

Sure enough, getent is 32-bit and our libnss_winbind.so is 64. Old Solaris, and by lineage SmartOS, has a mix of 32 and 64-bit processes and commands in a 64-bit environment. Pkgsrc doesn’t support a mix of 32-bit and 64-bit in packages, so the 32-bit commands are out of luck in a 64-bit Samba package obtained through pkgsrc.

Can we get 32-bit packages?

Long story short, we can! We will use pkg_add, with a few special options to convince it to install 32-bit packages in a separate directory (we chose /opt/i386):

  • Set PKG_PATH environment variable to the value specified in /opt/local/etc/pkg_install.conf, substituting i386 for x86_64. Use this command to see what you should use: “sed -n ‘s/x86_64/i386/p’ /opt/local/etc/pkg_install.conf”.
  • “-m i386”, override the architecture otherwise the process will error.
  • “-P /opt/i386 -I”, install the packages to an alternate location and don’t run scripts, preventing the overwriting of any 64-bit packages.

Install the 32-bit samba package (will also install all 32-bit dependencies):

# sed -n 's/x86_64/i386/p' /opt/local/etc/pkg_install.conf
PKG_PATH=http://pkgsrc.joyent.com/packages/SmartOS/2015Q4/i386/All
# env PKG_PATH=http://pkgsrc.joyent.com/packages/SmartOS/2015Q4/i386/All pkg_add -I -m i386 -P /opt/i386 samba
[...]

Since we are now working with a new directory structure, we need to link libnss_winbind.so again:

# ln -s libnss_winbind.so /opt/i386/opt/local/lib/nss_winbind.so.1
# 
Troubleshooting

Let’s try getent again:

# getent passwd
[no domain users]

Using truss, we see that getent isn’t looking in our new i386 folder. We then need to ensure that 32-bit processes use the 32-bit libraries.

Replace /opt/local/lib with the i386 path in ld.config:

# crle -c /var/ld/ld.config -l /lib:/usr/lib:/opt/i386/opt/local/lib -s /lib/secure:/usr/lib/secure
# 

The base-64 library paths are stored in a different ld.config, so we will add /opt/local/lib there so that the 64-bit processes can find the Winbind-related libraries:

# crle -64 -c /var/ld/64/ld.config -l /opt/local/lib -u
# 
Troubleshooting

Running getent again, we still find no domain users. Back to truss, we discover that it is trying to load /opt/local/lib/samba/private/libwinbind-client-samba4.so, a 64-bit library, but failing with another “ENOTSUP” error.

Add /opt/i386/opt/local/lib/samba/private to ld.config:

# crle -c /var/ld/ld.config -l /opt/i386/opt/local/lib/samba/private -u
# 

Testing again with getent, we now have domain users! Restart the ssh and cron services, and re-enable the name-service-cache service in case you disabled it earlier:

# svcadm restart cron
# svcadm restart ssh
# svcadm enable name-service-cache
#

Now we head back to part 1, continuing with the PAM configuration.

We get to the last step, but SSH access doesn’t work as advertised.

Troubleshooting

Checking in the authentication logs at /var/log/authlog, we see this PAM error message:

2016-02-22T22:10:00+00:00 localhost cron[9433]: [ID 705739 auth.error] open_module[0:/etc/pam.conf]: /opt/local/lib/samba/security/pam_winbind.so failed: ld.so.1: cron: fatal: /opt/local/lib/samba/security/pam_winbind.so: wrong ELF class: ELFCLASS64

So the path we added in pam.conf was for the 64-bit pam_winbind module, let’s update that path to work for both 32 and 64-bit processes requiring PAM.

We need to update pam.conf with a link to the base-32 pam_winbind.so, adding “$ISA” for 64-bit process support:

other auth requisite          pam_authtok_get.so.1
other auth required           pam_dhkeys.so.1
other auth required           pam_unix_cred.so.1
other auth sufficient         /opt/i386/opt/local/lib/samba/security/$ISA/pam_winbind.so use_first_pass
other auth sufficient         pam_unix_auth.so.1

other account sufficient      /opt/i386/opt/local/lib/samba/security/$ISA/pam_winbind.so use_first_pass
other account requisite       pam_roles.so.1
other account required        pam_unix_account.so.1

other session required        pam_unix_session.so.1
other session required        /opt/i386/opt/local/lib/samba/security/$ISA/pam_winbind.so

other password required       pam_dhkeys.so.1
other password sufficient     /opt/i386/opt/local/lib/samba/security/$ISA/pam_winbind.so
other password requisite      pam_authtok_get.so.1
other password requisite      pam_authtok_check.so.1
other password required       pam_authtok_store.so.1

A link to the 64-bit module needs to be inserted into a subdirectory of the 32-bit path, since “$ISA” will be replaced with “64” for 64-bit services utilizing PAM:

# mkdir /opt/i386/opt/local/lib/samba/security/64
# ln -s /opt/local/lib/samba/security/pam_winbind.so /opt/i386/opt/local/lib/samba/security/64/
# 

SSH access via domain credentials is now available in our base-64 zone:

$ ssh jeremy.einfeld@10.88.88.148
Password:
Could not chdir to home directory /home/jeremy.einfeld: No such file or directory
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; Instance (base-64-lts 15.4.0)
                   `-'  https://docs.joyent.com/images/smartos/base

-bash-4.1$

Our next step is to polish our Winbind deployment, getting it ready for a real environment in part 3.

Winbind in SmartOS, Part I (the Basics)

Centralized authentication is a terrific tool for anything more than a handful of servers, and Active Directory is often the go-to for authentication within a datacenter.

After testing several centralized-authentications for suitability in our environment, we settled on Winbind both for its features and the issues presented by the other solutions. A future post may detail the investigation and decision, but the biggest reason for us to use Winbind (of several reasons) was the ability to limit SSH access to servers through Active Directory group membership

Note that these steps apply to local SDC/Triton deployments and the Joyent Public Cloud, in addition to stand-alone SmartOS servers, and no changes should be necessary in Active Directory.

TL;DR: see this setup script for a summarization of steps taken in this series, and this repository for configuration files used. Note that some details from part 3 are not covered in the repository, see the readme for a list.

Troubleshooting Steps

You can ignore these if you just want to get Winbind working, but check them out if you want to see why the prior command didn’t work and how the next commands fix it.

WARNING

Don’t try this in base-64 (yet), there are additional difficulties that are encountered that will be handled in part 2. Use base-32 for now. Additionally, image releases prior to 15.2.0 require additional steps that will not be covered in this series so using 15.2.0+ is recommended.

Configuring Kerberos

Winbind uses a variety of protocols to interact with domain accounts, and the primary one that we need to set up is Kerberos. The configuration is managed in /etc/krb5/krb5.conf, and we can use this minimalistic config to get us started:

[libdefaults]
        default_realm = DOMAIN.NET
        dns_lookup_kdc = true
        default_tkt_enctypes = rc4-hmac des-cbc-crc des-cbc-md5

[realms]
        DOMAIN.NET = {
                kpasswd_protocol = SET_CHANGE
        }

Install the mit-krb5 package, then test Kerberos key initialization:

# pkgin -y install mit-krb5
[...]
# kinit [user]
kinit: Configuration file does not specify default realm when parsing name [user]
#

Something isn’t adding up, we did set default_realm in krb5.conf.

Troubleshooting

Let’s use truss to see where kinit is looking for the config file:

# truss kinit [user] 2>&1 >/dev/null | grep krb5.conf
stat("/opt/local/etc/krb5.conf", 0xFFFFFD7FFFDFF390) Err#2 ENOENT
stat("/etc/krb5.conf", 0xFFFFFD7FFFDFF390) Err#2 ENOENT
#

There’s the issue, the default krb5.conf included by SmartOS isn’t where Kerberos is looking.

Link krb5.conf to /etc/:

# ln -s krb5/krb5.conf /etc/

Test Kerberos again:

# kinit [user]
Password for [user]@DOMAIN.NET:
# klist
Default principal: [user]@DOMAIN.NET

Valid starting     Expires            Service principal
02/18/16 18:31:54  02/19/16 04:32:06  krbtgt/DOMAIN.NET@DOMAIN.NET
        renew until 02/19/16 18:31:54
# 

Success! Kerberos is now ready to start working for Winbind.

Configuring Winbind

Our first step for Winbind itself is to install the necessary package via pkgsrc. Winbind is bundled in the Samba package, so that is the one that we will use here:

# pkgin -y install samba

We then need to configure Winbind. The settings are stored in the Samba config file at /opt/local/etc/samba/smb.conf; here is a bare-bones smb.conf for our “domain.net” access:

[global]
        workgroup = DOMAIN
        realm = DOMAIN.NET
        idmap config * : backend = tdb
        idmap config * : range = 1000 - 9999
        idmap config DOMAIN:backend = rid
        idmap config DOMAIN:range = 10000 - 1073751823
        idmap config DOMAIN:schema_mode = rfc2307
        kerberos method = secrets and keytab
        security = ADS
        winbind enum groups = yes
        winbind enum users = yes
        winbind offline logon = yes
        winbind use default domain = yes
        template homedir = /home/%U
        template shell = /usr/bin/bash

We now domain-bind the instance using the Kerberos ticket we created previously with kinit:

# net join -k
Using short domain name -- DOMAIN
Joined '[HOSTNAME]' to dns domain 'domain.net'
#

Now that we have Winbind configured, we can start the daemon with this command:

# winbindd
#

Don’t worry about starting any of the Samba services, they are outside of the scope of Winbind. Test Winbind with these commands:

# wbinfo -u
[list of domain users]
# wbinfu -g
[list of domain groups]
# wbinfo -i [user]
[passwd entry for user, including full name, id, group id, home directory, and shell]
#

Configuring NSS

User and group data is made available in SmartOS through the Name Service Switch (NSS) facility, and now that Winbind is working we can add it as a source for that information. Update the passwd and group settings in /etc/nsswitch.conf, appending “winbind”:

passwd:     files winbind
group:      files winbind

We can then test to see what the system sees as the users, groups, and credentials with these commands:

# getent passwd
[passwd]
# getent group
[groups]

Winbind sees users and groups, but the system isn’t seeing them through NSS.

Troubleshooting

Let’s run getent through truss to see if it is doing anything with the Winbind NSS module:

# truss getent passwd 2>&1 >/dev/null | grep winbind
#

Nothing. A grep for “nsswitch.conf” also returns nothing, so it appears that getent isn’t even using NSS. Looking at the SMF services, there is a name-service-cache service that is keeping user and group data cached (negating the need for getent to check through NSS). Disabling that service, we try getent again:

# svcadm disable name-service-cache
# truss getent passwd 2>&1 >/dev/null | grep winbind
stat64("/lib/nss_winbind.so.1", 0x080471A0)     Err#2 ENOENT
stat64("/usr/lib/nss_winbind.so.1", 0x080471A0) Err#2 ENOENT
#

Now we’re talking. So where is nss_winbind.so.1, if not there? A few searches later, we find that there is no “nss_winbind.so.1”: the Samba package loads a libnss_winbind.so into /opt/local/lib/. We need to link libnss_winbind into a location that is checked for NSS modules:

# ln -s /opt/local/lib/libnss_winbind.so /lib/nss_winbind.so.1
ln: failed to create symbolic link '/lib/nss_winbind.so.1': Read-only file system
# ln -s /opt/local/lib/libnss_winbind.so /usr/lib/nss_winbind.so.1
ln: failed to create symbolic link '/usr/lib/nss_winbind.so.1': Read-only file system
#

That didn’t exactly work, since /usr and /lib are read-only. With that knowledge, we need to change the search locations by modifying (indirectly) ld.conf. We can see the current configuration by running this command:

# crle

Configuration file [version 4]: /var/ld/ld.config
  Platform:     32-bit LSB 80386
  Default Library Path (ELF):   /lib:/usr/lib
  Trusted Directories (ELF):    /lib/secure:/usr/lib/secure

Command line:
  crle -c /var/ld/ld.config -l /lib:/usr/lib -s /lib/secure:/usr/lib/secure

#

It even gives us the command to use to set it to replicate the current settings! Don’t forget to re-enable the name-service-cache service.

Add /opt/local/lib to ld.config, restart the services that need the new library path, and link libnss_winbind.so as nss_winbind.so.1:

# crle -c /var/ld/ld.config -l /opt/local/lib -u
# svcadm restart ssh
# svcadm restart cron
# ln -s libnss_winbind.so /opt/local/lib/nss_winbind.so.1
# svcadm restart name-service-cache
# getent passwd
[passwd including domain users]
# getent group
[groups including domain groups]

Configuring PAM

Our last step for basic Winbind auth is to configure the Pluggable Authentication Module (PAM) to use the pam_winbind module. We need to look for these sections in /etc/pam.conf and add the module into them (order does matter within each section):

WARNING

Adjusting the PAM configuration is a delicate operation: make any mistakes and authentication will no longer work. Always leave a session open when working with pam.conf, since a mistake won’t affect sessions that are already active. There is an additional safety net available in SmartOS: you can edit pam.conf from the SmartOS global zone via /zones/[uuid]/root/etc/pam.conf.

other auth requisite          pam_authtok_get.so.1
other auth required           pam_dhkeys.so.1
other auth required           pam_unix_cred.so.1
other auth sufficient         /opt/local/lib/samba/security/pam_winbind.so use_first_pass
other auth sufficient         pam_unix_auth.so.1

other account sufficient      /opt/local/lib/samba/security/pam_winbind.so use_first_pass
other account requisite       pam_roles.so.1
other account required        pam_unix_account.so.1

other session required        pam_unix_session.so.1
other session required        /opt/local/lib/samba/security/pam_winbind.so

other password required       pam_dhkeys.so.1
other password sufficient     /opt/local/lib/samba/security/pam_winbind.so
other password requisite      pam_authtok_get.so.1
other password requisite      pam_authtok_check.so.1
other password required       pam_authtok_store.so.1

With that configured, we can finally get SSH access via domain credentials:

$ ssh [user]@[host]
Password:
Could not chdir to home directory /home/[user]: No such file or directory
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; Instance (base-32-lts 15.4.0)
                   `-'  https://docs.joyent.com/images/smartos/base

mail: Cannot open file '/var/mail/' for output
-bash-4.1$

Not perfect, but we have Winbind access! Upcoming posts will detail how to get Winbind working in base-64 (part 2) and polishing it up for a real environment (part 3).

Faithlife’s sdc-portal

Today we’re pleased to announce that our developer / customer facing portal for Joyent‘s SmartDataCenter 7 has been open sourced.

After transitioning from OpenStack to SDC 7 the only thing we were left wanting was a portal for developers and non-operations staff (SDC already has a portal for Admin and Operations engineers). Luckily, SDC has a fantastic set of APIs that we’ve leveraged to create sdc-portal. The portal started out of necessity because our developers were used to having at least the ability to start, stop and reboot their VMs, but it is growing in to much more. Today we want to give back to the open source community that has helped us immensely and invite others to help us make sdc-portal even better.

The portal is in its infancy, and we’ll be iterating on the documentation and feature set rapidly in the next few weeks. However; we’ve been encouraged by Joyent and a few other organizations to make the code available now due to the high demand and the large set of people that are eager to contribute.

Today the portal supports the following features:

  • OAuth sign-in
  • Integration with SmartDataCenter 7 and Joyent’s public cloud
  • Start, stop and reboot VMs
  • Get the current status and information about VMs

In the next few days and weeks we’ll be adding the following features:

  • Generic authentication provider support
  • VM provisioning
  • SSH key management
  • Things the community comes up with…

We welcome your feedback in our sdc-portal group, in the #smartos IRC channel, or in the form of GitHub issues. We also welcome your pull requests!

sdc-portal lab

Hardware – Part II (Compute)

Compute hardware

Faithlife compute has gone through quite a few iterations in recent years. The transformation has been a critical piece of our success and ability to scale at costs that make sense to the business. Each iteration moves our deployment closer to being aligned with our overall philosophy.

Humble beginnings

Our first attempt at being more nimble, and reducing costs over our aging and expensive IBM physical server deployments, was VMware vCenter on an IBM Bladecenter backed by an IBM DS3500 SAN. Yes, you read that correctly, and yes, we may not have thought that decision through entirely.

Screen Shot 2015-03-25 at 3.04.02 PM

Plenty of flexibility was gained by virtualization, but the cost of the Bladecenter, Blades, SAN, and VMware licensing meant that even the smallest incremental addition to the infrastructure represented a dollar amount that needed lots of discussion before approval. These factors lead to projects being put on hold, developers not having the resources they needed, and Operations constantly battling an infrastructure running at or above capacity.

Commodity hardware, take one

Realizing that we were hamstrung by expensive hardware and licensing, we took to the basement and started a skunkworks project.

After a couple of weeks, and one hundred dollars, we emerged victoriously from the basement. We assembled twelve Dell Optiplex 960 workstations as OpenStack compute nodes, three APC “PDUs”, six Cisco desktop switches, and some really awesome 1Gb Ethernet, all on a Costco rack. Believe it or not, we actually replaced a few of our aging development servers with this for quite a few months. Though, I think we took commodity hardware a bit too seriously, and our datacenter wouldn’t allow us to deploy it in our cage.

2013-04-04 15.59.23

Commodity hardware, take two

Having prototyped OpenStack, and shown that it had the potential to both run on commodity hardware and replace our current virtualization stack, we moved forward with a small production deployment to help deal with some of our capacity issues.

Our initial production OpenStack deployment consisted of three controllers, three compute nodes, and eight Ceph nodes. This was also the beginning of our servers becoming multi-purpose Lego Bricks. We used Dell R610 1U servers in one of three different configurations for all things. Additionally, we started keeping some spare memory, disk, CPU, and R610 chassis on hand. Since we had spare parts and a single server kind, we could easily fix or replace any piece of our hardware infrastructure.

2014-01-13 17.06.55

The relatively low cost of the Dell R610 1U servers combined with free and open source virtualization meant we could finally remove the dam that was holding back additional gear. It took less than four months to go from the initial nine servers to one and a half racks full of gear.

During the build out we realized that our initial SAS based Ceph nodes did not have sufficient performance for database volumes and were too expensive for general purpose OS volumes. The solution was to add two new types of servers: Dell R620 filled with SSD, and Dell R510 filled with SATA.

2014-08-21 11.58.26

When OpenStack and Ceph went in to a death spiral and we transitioned to Joyent’s SmartDataCenter, we were able to reuse this same hardware for the emergency deployment with minor configuration changes and on hand parts (just one more reason Lego Bricks for hardware is so important).

Commodity hardware, take three (Joyent SmartDataCenter / current day)

Shortly before we transitioned to Joyent SmartDataCenter, we acquired space in a brand new datacenter. This gave us a nice green field to apply the last few years’ worth of hard earned lessons and also build specifically for SmartDataCenter. Lucky for us the great people at Joyent open sourced their build of materials, which gave us a higher degree of confidence that our new build would be successful (after all Joyent has already proven these builds in private and public clouds).

We really liked the Tenderloin-A/256 build based on price, disk performance, and density. Unfortunately the Tenderloin-A/256 build is based on SuperMicro parts, and we’re more comfortable with Dell servers; we have a great relationship with Redapt, a Dell partner who we purchase most of our hardware through. In that light, we worked with Redapt and Joyent to create a Dell build that is very close to the Tenderloin-A/256 Joyent build.

2015-01-07 21.22.29

Faithlife’s SmartDataCenter compute node build of materials

  • 1 x Dell R720 Chassis
  • 2 x Intel Xeon E-2650v2
  • 1 x iDRAC7 Enterprise
  • 1 x Intel X520 DP 10Gb DA/SFP+, + L350 DP 1Gb Ethernet Daughter Card
  • 1 x Intel / Dell SR SFP+ Optical Transceiver
  • 16 x Dell 16GB RDIMM 1866MT/s (256GB total)
  • 2 x 750W Power Supply
  • 1 x 200GB Intel DC S3700 SSD
  • 1 x Kingston 16GB USB stick
  • 1 x SuperMicro AOC-S2308L-L8E SAS controller
  • 15 x C10K900 HGST 2.5” 10K 600GB SAS

We’ve been running SmartDataCenter on this build with hundreds of VMs for a while now. The performance is outstanding; in fact, some of our VMs that previously needed dedicated SSD are just as happy on this SAS based configuration thanks to SmartOS zones and ZFS.

 

Service Interruption 1/27/15

Between 9:50PM PST and 10:28PM PST on January 27th 2015, most Faithlife sites and service were unable to talk to the public Internet. We’re sorry for the interruption this caused and we’re taking steps to prevent the likelihood of this happening again.

Cause

Our edge routers needed to be patched, due to the “Ghost” glibc vulnerability or CVE-2015-0235. The patching process of our primary edge router froze while updating Quagga, the daemon responsible for BGP and OSPF. The frozen patch process was subsequently killed, which unexpectedly killed the active Quagga daemon. When the Quagga daemon stops, that node is no longer able to advertise our ASN and public subnet. Normally, this should result in a very small interruption, because our secondary edge router should start advertising our public subnet to its already established BGP session with a different ISP. Unfortunately, we are in the process of making large changes to our secondary and a few of the more important routes were misconfigured. This yielded the secondary edge router mostly unusable. Because the patch was being applied remotely over a VPN connection that relies on OSPF to talk to the router, the router was inaccessible. Due to the inaccessibility of the primary edge router, we drove to the data center immediately, physically connected to the machine, completed the patching, and restarted the Quagga daemon.

What We’re Doing

We’re currently going through a re-configuration of our edge routers and firewalls which will enable us to advertise our ASN and public subnet from multiple geographically diverse locations with different Internet Service Providers. This is actually a project that we hoped to have completed before going in to 2015, but contracts and difficulties with the physical layer proved tougher than expected. Once this is complete, an issue like this should only cause a very small interruption of service for a subset of our users. Additionally, we’ll be adding console switches with multiple out of band connectivity options so that we shouldn’t have to worry about burning the time it takes to run to the datacenter or create a remote hands ticket.

Hardware – Part I (Network)

The hardware powering Faithlife has seen a massive transformation in the last eighteen months. We’re really excited about all the cool new changes, and the measurable impact they’ve had on our employees, customers, and the products / features we’re able to offer. Given that, we thought that sharing our hardware configuration was a fun way to live our values and showcase what we think is pretty cool.

Philosophy

At Faithlife we value smart, versatile learners, and automation, over expensive vendor solutions. Smart, versatile learners don’t lose value when technology changes or the company changes direction, vendor solutions often do. If we can use commodity hardware and free open source software to replace expensive vendor solutions, we do.

Commodity hardware is generally re-configurable and reusable, and lets us treat our hardware like Lego Bricks. Free open source software allows us to see behind the curtain, and more easily work with other existing tools. We’re empowered to fix our own issues by utilizing the talent we already employ, not just sit on our hands waiting for a vendor support engineer to help us out (though we do like to keep that option available when possible). Additionally, combining commodity hardware with automation tools like Puppet, we’re able to be nimble.

By being nimble, leveraging in house talent, Lego Brick-ish hardware, and free open source software, we’re able to save a considerable amount of cash. Saving cash on operational expenses enables us to make business decisions that would have otherwise been cost prohibitive. At Faithlife we have large company problems, with a small company budget.

Network hardware

Not long ago we were exhausting a variety of Cisco, and F5 1Gb network gear. Bottlenecks were popping up left and right, packet loss was high, retransmits were through the roof, and changes to network hardware happened at a glacial pace. We were beyond the limits of 1Gb, our topology was problematic, and shortcuts were continually being taken in order to keep up with the demand of our sites and services. At the same time, we had just begun the process of moving to Puppet and automating our server deployments, which meant we could easily outpace network changes. Additionally, the gear did not a fit our hardware philosophy.

Fast forward to today, our current data center topology is a modified spine and leaf, or “folded clos” design. We use OSPF to route traffic between cabinets and a pair of leaf switches are placed in each cabinet. The leaf switch pairs represent a layer 2 boundary and allow us to MLAG our servers to maintain switch redundancy within the layer 2 boundary. In addition, a pair of spine switches are placed in an end of row networking cabinet. We have multiple edge routers and firewalls connected to an area border router via OSPF. Furthermore, the edge routers are connected to ISPs via BGP.

Spine

Dell S6000-ON and Penguin Arctica 3200XL — both run Cumulus Linux

  • 32 Ports of 40Gb QSFP+

Leaf / Area Border Router

Dell S4810-ON and Penguin Arctica 4804X — both run Cumulus Linux

  • 48 Ports of 10Gb SFP+ plus 4 ports of 40Gb QSFP+

Management

Penguin Arctica 4804i — running Cumulus Linux

  • 48 Ports of 1Gb plus 4 ports of 10Gb SFP+

Edge Router / Firewall

Dell R610 1U Servers:

  • Dual Intel X520-DA2 NIC with Intel SFP+ optics
  • Dual Intel X5650 CPU
  • 96GB of RAM (Helps with Internet routing tables, IPS, firewall states, etc.)

Routers run Ubuntu Linux with Quagga for OSPF and BGP.

Firewalls run PFSense (FreeBSD based) with Quagga for OSPF, and Suricata for IPS.

Cables

Amphenol 10Gb SFP+ DAC

Amphenol 40Gb QSFP+ DAC

Amphenol 40Gb QSFP+ Active Optical

FiberStore multi-mode fiber

Transceivers

FiberStore 10Gb SR Optics

Intel 10Gb SR and LR Optics (for compatibility with X520-DA2 cards)

Seattle Data Center Network Cabinet

(please excuse the screwdriver and loose fiber, this was a work in progress at the time)

SeattleNetworkRack

SATApocalypse

Storage unavailability Friday November 21st – 26th, 2014

I’d like to apologize for the trouble you undoubtedly had accessing Faithlife products and services between November 21st and 26th. The reliability and availability of Faithlife products and services is critical to your success and ours. Understanding what happened is a necessary step towards reducing the probability of this type of event happening again.

Summary of events (All times approximate)

4:00 PM Pacific on November 21st, a storage pool in our Bellingham data center had three of fifty-five drives marked down and out due to a failure to respond within five minutes to the rest of the cluster. Since our storage pools are configured to be triple-redundant, the cluster began a rebalance of its data to ensure the triple-redundant guarantee. Normally, a three drive failure and rebalance would be a minor inconvenience. Unfortunately, so many virtual machines had been provisioned on this pool during the Logos 6 launch that IOPS demands on the pool were already at or above the pools capability. The result was slow, but available disk. The three problematic disks were identified, but our logs and monitoring software did not point to an actual disk failure. The problem disks were manually marked down and out to prevent them from coming back in the cluster. Since there was plenty of redundant disk and things were functional, the plan was to replace the problem disks the next morning.

10:45 PM Pacific on November 21st, the rebalance stalled and disk operations were extremely degraded. Stalled object storage daemons were re-started one at a time. The rebalance continued and storage was somewhat usable again.

2:30 AM Pacific on November 22nd, four more drives were marked down and out. Enough disks had been lost that a large portion of the storage pool was experiencing paused disk operations as a protection against data loss. This event took a large portion of our web infrastructure down and left only a few systems able to function in a degraded state. Our monitoring systems did not produce data that suggested these disks were unhealthy. However, operating system logs pointed to problems with the XFS partitions. Further investigation showed that the disk controllers marked these four drives as critical and that one of the controllers had its battery backed cache die. The four failed drives were manually marked down and out, and we headed to the data center to build up a new storage pool node. This node was to take the place of the failed drives, allow the cluster to start healing, and unpause disk operations. We also planned to immediately replace the controller with the failed battery backed cache.

6:30 AM Pacific on November 22nd, when replacing the failed battery backed cache, the power cord for one of the active and healthy nodes was accidentally pulled. The node was immediately brought back into service, but the sudden power loss resulted in two journal partitions becoming corrupt and the loss of the object storage daemons backing them. This brought the lost drive count to nine of fifty-five and furthered the degraded state of the pool. The battery backed cache was properly replaced, and the new storage node was added by approximately 9:30 AM Pacific. The rebalance was able to continue, but at a very slow rate. We estimated it would take thirty-six hours before the pool was in a usable state. All available production resources had been consumed by the Logos 6 launch, and the decision was made to pull all resources from our on premise lab and build out a parallel cloud deployment. This would allow us to quickly replace affected virtual machines while the storage pool recovered. In the meantime, virtual machines hosted by the affected storage pool were shut down to prevent them from servicing live requests when they were periodically available.

9:00 PM Pacific on November 22nd, gear was obtained, racked and provisioned at our Bellingham data center. Proclaim and Commerce related sites and services were chosen as first recipients of the new deployment.

11:00 PM Pacific on November 22nd, all of Proclaim and its related sites and services were functional. Commerce related virtual machines were provisioned and awaiting final configuration and code deployment. Other sites and services were provisioned and deployed as new hardware became available in the following days.

Between November 24th and November 25th, functionality had been restored to all but our Exchange deployment. We did not want to restore Exchange from backup on to alternative deployment because it meant losing some email. Our efforts turned entirely to successful recovery of the storage pool.

The storage pool rebalance had essentially finished, but writes were still paused. The pool had five incomplete and stuck placement groups, and hundreds of slow requests. Hope of a normal recovery was gone and we began working through documentation for troubleshooting slow requests and incomplete placement groups.

The documentation pointed us at four possible causes: a bad disk, file system/kernel bug, overloaded cluster, or an object storage daemon bug. It also proposed four possible resolutions: shutdown virtual machines to reduce load, upgrade the kernel, upgrade Ceph, or restart object storage daemons with slow requests. Disks were replaced, virtual machines were already shut off, and ceph was upgraded. Upgrading the kernel was not an appealing option because restarts would be required. Restarts meant either letting a rebalance happen while the drives went away, or placing the cluster in a no-recover state. Further rebalancing would put more stress on disks and put us at risk of losing more drives. Putting the cluster in a no-recover state, even momentarily, seemed inappropriate. Since it appeared that the five incomplete placement groups were causing the paused writes, the decision was made to mark the placement groups lost and deal with any potential data loss. Unfortunately, the cluster refused to respect marking these placement groups as lost. At this point we worked on the assumption that we’d hit a bug in Ceph and engaged the Ceph IRC channel, which proved unhelpful.

We felt as if our options consisted of digging in to Ceph source code, or engaging InkTank support. We felt it necessary to make engaging InkTank support the first step. We were lucky enough to get six hours of free support from InkTank while they set up our newly purchased support contract. Their engineer walked through many of the same steps we had, and we were able to provide them with output and logs that accelerated their troubleshooting. It was decided by the InkTank engineer that we had hit bug in Ceph and potentially an XFS bug in the particular Linux kernel used on this storage pool. The five placement groups in question were not assigned to any storage pools, which is a state that should never happen. After talking with Ceph developers, the InkTank engineer provided us with steps to work around the bug.

Unfortunately, the resolution included losing the data stored on the five placement groups. The data loss materialized as lost sectors to virtual machines, which meant running fsck/chkdsk on hundreds of virtual machines. The other fall out is that the Exchange databases needed a lot of repair.

How we’re changing

Try as they may to be redundant, OpenStack and Ceph architecturally force non-obvious single points of failure. Ceph is a nice transition away from traditional storage, but at the end of the day it is just a different implementation of the same thing. SAN and Software Defined Storage are all single points of failure when used for virtual machine storage. OpenStack enabled us to scale massively with commodity hardware, but proved unsustainable operationally speaking.

Starting with our emergency cloud deployment, we’ve moved away from OpenStack and centralized storage. Instead, we’ve gone with Joyent’s SmartDataCenter 7. SmartDataCenter 7 has made some key architectural decisions that better fit with our infrastructure philosophies. Simply put, each physical host in SmartDataCenter 7 is capable of surviving on its own as long as power and network are available.

Even great products like SmartDataCenter 7 can’t run if our data center suffers a power, cooling, or connectivity failure, which is why we’ve been working hard the last few months to get our brand new Seattle-area data center online. Not only will we have redundant hardware in different geographic locations, we’ll also have far more Internet connectivity in Seattle. This will result in reduced latency for our customers and the ability to withstand routing failures at the Internet Service Provider level.

Over the last year, the Development and Operations departments at Faithlife have had a large cultural shift which includes increased collaboration, shared responsibility, the removal of artificial boundaries that create “not my problem” scenarios, and making tooling, automation and alerting first class products. Still, we have a lot of room for cultural growth. Admittedly, we knew our storage was running at or above its capability before disaster struck. However, because of a cultural tension between old and new, there was a real fear that changing anything at this critical time was more risky than just leaving things alone. This is a fallacy and Baron Schwartz points this out better than I can in his blog post Why Deployment Freezes Don’t Prevent Outages.

Our Operations team went through a full year of being stretched mentally and physically. Not only is that not healthy for Faithlife’s employees, it reduces our quality of work and decision making. So we’re adding more people to the teams that support Faithlife’s infrastructure and making a proper work / life balance one of the most important goals for 2015.

I can’t say enough about the amazing team we have here at Faithlife. Operations and Development came together and worked an insane amount of hours to mitigate and solve this massive problem in a very short amount of time.

Thank you

Many of our customers left encouraging feedback in our forums during this outage, and I want to thank you all for that. The encouraging feedback was an uplift during a very trying time. Furthermore, thank you all for your business and understanding.