Using Docker to Manage Erlang Environments for Riak

Basho packages their own fork of Erlang/OTP along with Riak and Riak CS. The forks are typically an older version of a stable Erlang/OTP release with a few patches. Eventually, all patches included in the Basho fork are merged into later versions of an official Erlang/OTP release.

If you’re installing Riak and Riak CS from a package, then all of the hard work that surrounds bundling a custom version of Erlang/OTP has been taken care of for you. On the other hand, if you are installing Riak or Riak CS from source, then you may want to install the forked version of Erlang/OTP as well.

Docker

Docker gives us a nice way to setup an isolated environment for installing Erlang/OTP and Riak. More specifically, the docker-basho-otp image makes the whole process one step simpler by starting you off with an already built Basho fork of Erlang/OTP. As of this post, the latest custom build of Erlang/OTP is R16B02_basho5. This version is meant to be paired with Riak 2.0+.

First, we need to pull down the image that contains R16B02_basho5:

$ docker pull hectcastro/basho-otp

Next, we need to start a container and invoke /bin/bash:

$ docker run -t -i --rm hectcastro/basho-otp /bin/bash

Now, let’s test to make sure that the correct version of Erlang/OTP is available:

$ erl
Erlang R16B02-basho5 (erts-5.10.3) [source] [64-bit] [smp:4:4] [async-threads:10] ...

Eshell V5.10.3  (abort with ^G)
1>

(Control + C and then a for abort gets you out of this shell.)

Riak

Solid Erlang/OTP environment? Check.

Now we need to pull down the Riak 2.0 source code to build what’s referred to as a devrel. A devrel (or development release) automates the creation of 5 separate copies of Riak. After the devrel process is complete, you can start each copy of Riak and join all of the instances into a cluster.

First, let’s clone the Riak repository and checkout the latest Riak 2.0 tag (as of this post, the most recent tag is riak-2.0.0rc1):

$ git clone https://github.com/basho/riak.git
Cloning into 'riak'...
remote: Reusing existing pack: 16251, done.
remote: Counting objects: 6, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 16257 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (16257/16257), 11.90 MiB | 40.00 KiB/s, done.
Resolving deltas: 100% (10241/10241), done.
Checking connectivity... done.
$ cd riak
$ git checkout riak-2.0.0rc1
Note: checking out 'riak-2.0.0rc1'.
HEAD is now at 87b8934... Bump riak to 2.0.0rc1 for internal smoke testing

Next, let’s create the devrel (this step will take a few minutes):

$ make devrel DEVNODES=5

Almost there. The following steps will start all 5 Riak nodes and join them into a cluster:

$ cd dev
$ for node in `ls`; do $node/bin/riak start; done && \
    for n in {2..5}; do dev$n/bin/riak-admin cluster join dev1@127.0.0.1; done
Success: staged join request for 'dev2@127.0.0.1' to 'dev1@127.0.0.1'
Success: staged join request for 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
Success: staged join request for 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
Success: staged join request for 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
$ /dev1/bin/riak-admin cluster plan
=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
join           'dev2@127.0.0.1'
join           'dev3@127.0.0.1'
join           'dev4@127.0.0.1'
join           'dev5@127.0.0.1'
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 1 cluster transition

###############################################################################
                         After cluster transition 1/1
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%     20.3%    'dev1@127.0.0.1'
valid       0.0%     20.3%    'dev2@127.0.0.1'
valid       0.0%     20.3%    'dev3@127.0.0.1'
valid       0.0%     20.3%    'dev4@127.0.0.1'
valid       0.0%     18.8%    'dev5@127.0.0.1'
-------------------------------------------------------------------------------
Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 51
  12 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  13 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  13 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  13 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
$ /dev1/bin/riak-admin cluster commit
Cluster changes committed

And…we’re done. Say hello to your very own Riak 2.0 cluster, built on R16B02_basho5.

Bootstrapping Private Subnet Instances In A VPC with Knife

Amazon VPC

Amazon Virtual Private Cloud (VPC) is a service that allows you to define an isolated virtual network within EC2. A common scenario involves a VPC with both public and private subnets. Instances within public subnets can send and receive traffic directly to/from the Internet. On the other hand, instances within private subnets cannot receive traffic directly from the Internet and can only send outbound traffic via a NAT instance.

Bastion Host

Given a VPC setup with both public and private subnets, you’ll want at least one SSH bastion host in the public subnet. This host is needed to communicate with instances in the private subnet from your local machine. The diagram below, taken from Amazon’s documentation, helps illustrate:

SSH Bastion with VPC

Knife EC2 Example

Using a combination of Knife and the Knife EC2 plug-in, the following command connects directly to the bastion host specified by the --ssh-gateway option. From there another connection is made to the private subnet instance via its private_ip_address in order to bootstrap Chef:

knife ec2 server create --flavor hi1.4xlarge --image ami-08249861   \
  --security-group-ids [SECURITY_GROUP_ID] --tags Name=node1-dev    \
  --availability-zone us-east-1d --subnet [SUBNET_ID]               \
  --node-name node1-dev --ssh-key orgname --ssh-gateway bastion-dev \
  --server-connect-attribute private_ip_address                     \
  --ssh-user ec2-user --identity-file ~/.ec2/orgname.pem            \
  --environment development --ephemeral '/dev/sdb,/dev/sdc'         \
  --run-list 'role[base],role[solr_ssd_slave]'

Depending on how long it takes your run list to converge on a bare operating system, you should have Chef bootstrapped on an instance within the private subnet of a VPC after running only one command!

Lessons Learned After Getting Gotten

A Cool Story

Last Saturday morning I was visiting my parent’s house in North Philadelphia. From there I was headed back into the city for a technology conference. I decided to take public transportation, so I drove to the nearest subway station. I parked my car in a lot next to the station and headed to the platform. A train was already waiting for passengers, so I got on the closest car from the steps that lead to the platform.

I wasn’t dressed like a mark, but I also wasn’t dressed to blend in with my environment. I had an Adidas track jacket on with shorts, flip-flops, and a laptop bag. Looking back, I can see why I may have stood out.

As I walked into the subway car, I noticed that it was completely empty. Having my choice of seat, I uncharacteristically sat in the seats next to the sliding doors that separate subway cars. Unfortunately, this put my back toward anyone coming onto the car. After sitting down, I proceeded to mess with my phone — checking Twitter, or e-mail, or Instagram. The next thing I know, I get punched on the side of the head and my phone is taken right out of my hand.

After seeing the kid run off of the subway car, I decided to chase after him. To set the stage, as soon as you get out of the car there are 6-8 steps up, a landing, another 6-8 steps, turnstiles, 6-8 steps down, a landing, and finally 6-8 more steps down — at this point you’re finally back at street level. By the time I was exiting the car, he was just about to start the second set of 6-8 steps up. I watched him slip on a step, which increased my chances of catching him. Unfortunately, as I approached the first landing one of my flip-flops got tangled and came off. I kicked the second one off and laid my laptop bag down.

By the time I made it to the turnstiles, he was almost at street level. Knowing that I just laid my laptop bag down unattended, I had to choose between going back for it or continuing to chase after my attacker. I chose the laptop. Temporarily defeated, I took a moment to assess the situation and asked one of the operators to contact the police.

As police officers began to show up at the station, it occurred to me that I could possibly track the attacker via Apple’s Find my iPhone feature. I opened my laptop in hopes of using an unprotected WiFi access point, but no cigar. I told one of the officers about the feature and he said that there were people at the station with experience using it. He immediately got on the phone and tried relaying my password to the operator — this took a lot longer than expected.

As soon as the police pinpointed the location of my phone, the officers at the subway station mobilized and took me in the back seat of a cruiser. The signal led us to a driveway between two rows of row homes. I told the police that the signal is usually not 100% accurate, so the kid could be in any of the surrounding houses. As they surrounded houses, I began looking through nearby trash cans hoping the kid had decided to toss the phone.

At this point the remote operator offered up a Find my iPhone feature that forces the device to emit a loud noise. The loud noise would give police probable cause to enter a house — otherwise it would have required a search warrant. I was against the idea of forcing the device to make noise, knowing that if we didn’t hear it immediately, the kid would turn the phone off — significantly reducing the chances of recovering it . About three minutes after they triggered the noise, the signal went dark.

Since there was not much else to do, they officers offered to drive me back to my car parked at the subway station. I didn’t feel like attending the conference anymore, so instead I drove back home. As soon as I got there I began the process of revoking passwords and disabling the SIM card. I had officially gotten got.

Lessons Learned

I like to consider myself pretty street smart. Looking back, there were many things I could have done to make myself less of a target. At the same time, there were a few things I did that made the scenario go a lot smoother than it could have gone.

Pros

  • I had Google two-factor authentication enabled. This allowed me to easily revoke application specific passwords created for services on the device.
  • I had Find my iPhone enabled via iCloud. This at least gave me a chance to catch the person who stole my phone. It also allowed me to request a remote lock of the device after it was stolen.
  • I have a different password for every account. This allowed me to give my password to the police over the phone without worring anyone else would remember it and compromise any of my other accounts.
  • I kept calm after everything happened, which allowed me to recall the Find my iPhone feature existed.

Cons

  • I had flip-flops on while in a moderately dangerous area. I’m convinced this prevented me from catching my attacker when he slipped on the steps.
  • I got on the subway car closest to the steps. This allowed the attacker to snatch my phone and make a speedy escape. From now on I will only board the first or last subway cars.
  • I had my back to the subway car entrance and was only paying attention to my phone. From now on I’ll never sit in the seats next to the doors that separate subway cars and I’ll limit the amount of time I use my phone on the subway.
  • I did not yell as soon as my phone was taken. This neighbordhood is not one known for snitching, but it’s possible that if I had yelled someone would have done something to slow down the attacker.

Conclusion

My replacement SIM card has arrived and now resides in a Google Galaxy Nexus. I rewired two-factor authentication to it and enabled a passcode lock for the phone itself. The phone’s file system is also encrypted (not the default on Android 4.1 Jelly Bean). Last but not least, I purchased a new pair of sneakers. My days of riding on the subway in flip-flops are over:

New Kicks

Upstart and Changing Users

Occasionally there is need to start a daemon process as an unprivileged user. If you’re running Ubuntu, a popular tool to accomplish this is Upstart. Before version 1.4 of Upstart, support for executing processes as an unprivileged user wasn’t directly supported — utilities like start-stop-daemon or elaborate su tricks were required:

exec start-stop-daemon --start --chuid user --exec command

Or without start-stop-daemon:

exec su -s /bin/sh -c command user

Beginning with Upstart 1.4 these techniques are no longer necessary. Enter setuid:

description "Tasseo"
author "Chef"

start on (filesystem and net-device-up)
stop on runlevel [!2345]

respawn
respawn limit 5 30

chdir /opt/tasseo
setuid tasseo

script
  export GRAPHITE_URL="http://graphite.example.com"

  exec ./bin/rackup -p 8080 2>&1 >> /opt/tasseo/log/tasseo.log
end script

emits tasseo-running

EBS, Software RAID, and Speeding Up Resyncs

I’ve been working to ship a product for $WORK that deals with large Apache Solr indexes and Elastic Block Store (EBS). I know what you’re thinking: Why EBS? A lot of people are pretty sour on EBS, and rightfully so — the service has a lot of obvious weaknesses. Despite those weaknesses, I decided to use it because I think my use case mitigates some of its flaws and highlights some of its strengths.

Index Flow

It all starts with a definitive Solr index that lives in an on-premise data center. Changes to it are made directly once a day or once a week, and after those changes are blessed the index gets transported via Solr replication to a single EC2 node (in Solr terminology, a repeater) with a 4 EBS volume RAID10 array. Once replication to the repeater is complete, snapshots of the volumes are taken using xfs_freeze to keep everything consistent.

In the end, the index get served to users from one of many Solr slaves. Slaves are fronted by an Elastic Load Balancer and are members of an Auto Scaling group. The Auto Scaling group spans multiple availability zones and ensures that at least two slaves are always operational.

As Auto Scaling brings slaves up, Chef and Chef Server configure them. Repeater snapshot references are discovered via Chef search and new EBS volumes for slaves are created based on them. As these volumes get reassembled into a RAID10 configuration, a software RAID “resync” is triggered. This process can take several hours and is meant to help confirm that the array is consistent.

Software RAID

Back in college, I used software RAID to protect my personal file server from disk failure. My repurposed desktop computer didn’t have a RAID controller, so the performance penalties of replicating disks in software were easily overlooked. I’m not sure how much software RAID got used by actual businesses back then, but I get the feeling that anyone doing anything serious was using hardware RAID controllers.

Unfortunately, modern day cloud computing services don’t expose machines with configurable RAID controllers. Instead, most offer seemingly infinite storage and split it up into one volume or bucket at a time. If you want to create your own controlled layer of redundancy for performance, you have to do it in software. Enter software RAID and mdadm.

Dreadful Resyncs

In a read-heavy scenario like this one, reassessing array consistency isn’t a major concern. Given that, I wanted to find a way to reduce or eliminate the resync process altogether. The solution, according to mdadm’s Wikipedia page, is to enable a write-intent bitmap. Unfortunately, enabling a write-intent bitmap must occur before the target RAID device is created. Even worse, Chef’s mdadm resource has no bitmap attribute!

This series of unfavorable circumstances led to a pull request and the following Chef execute resource snippets (to be used until the next version of Chef is released):

execute "create_md0" do
  command "yes | mdadm --create /dev/md0 --chunk=256 --level=10 --metadata=1.2 --bitmap=internal --raid-devices #{devices.length} #{devices.join(" ")}"
  not_if "mdadm --detail --scan | grep /dev/md0"
end

execute "assemble_md0" do
  command "yes | mdadm --assemble /dev/md0 #{devices.join(" ")}"
  not_if "mdadm --detail --scan | grep /dev/md0"
end

Conclusion

Non-blocking snapshots and new volume creation based off of existing snapshots are useful EBS features. Combine that with primarily read-only access patterns, and using EBS doesn’t look so bad. Variable I/O performance is still a concern, but RAID10 helps smooth some of that out. In addition, enabling a write-intent bitmap at software RAID creation time makes array reassembly almost instantaneous.

Now, instead of operating in a degraded I/O state for hours, slaves can aggressively service user requests in minutes. What a great competitive advantage! At least until Amazon releases an RDS style service for full-text search — I can’t imagine that’ll be anytime soon