the(art).of << fine.code

Technical Blog from the team at Redbubble.

Continuous Delivery at Redbubble

| Comments

I recently had cause to reflect on the fact that our engineering teams regularly release code changes into production multiple times within a single day and felt it worthwhile to elaborate on what that means and how it has changed the way we work as a business.

This practice is known as Continuous Delivery. Continuous Delivery is a logical extension of the Agile philosophy especially the practices of Extreme Programming. It requires a high degree of automation and a trivial deployment process. It also requires courage and the willingness to not take the easy way out in your overall processes.

Most deployment mechanisms can actually be run multiple times within a single day - it would be rare that a website deployment required more than a full day to complete. However, this quite often doesn’t translate into that mechanism being called into use that often. The challenge lies in the fact that to be able to release changes confidently as often as your mechanisms allow, many parts of your process need to be aligned.

I will give a brief walkthrough of our overall process, starting at the tail end and tracking back to the beginning.

Deployment

To release a new version of the application, we run the following command:

1
cap production deploy

This kicks off a series of operations on the deployment environment:

  • Update the Git repositories on each machine
  • Copy the code tree to a newly created release directory
  • Link in the environmental configuration
  • Bundle gems
  • Compile the assets
  • Make the new release the ‘current’ one
  • Notify NewRelic
  • Restart unicorns
  • Restart background worker processes
  • Notify Airbrake
  • Tag the revision

If we look in New Relic we see the release marked on the chart which allows us to see that everything is going okay. New Relic tracks an error rate so we can see if there is an increase in errors as a result of the release and take appropriate action (this happens rarely in practice). Here is an example of a typical day:

Verification of the release is done by looking at Airbrake and NewRelic. Airbrake captures all exceptions that the system raises, collates them and sends an alert email for the first instance of an error. By notifying Aribrake as part of the build, the known error list is cleared so that you can see which errors have been occurring since the latest release. By default there is code which captures Controller error and an API so that you can explicitly send errors, which we use for background processes. As a result, we very rarely need to look in logs for errors. Here are some screenshots of the overall list and what an error looks like:

On the occasion where the release contains a database schema change, we run the following command:

1
cap production deploy:migrations

Release preparation

Before we can push out a release, we need to go through a validation process, which looks like this:

  1. Commit changes on the feature branch
  2. Continuous Integration server build succeeds for the feature branch
  3. Create Github pull request to stage the change
  4. Test on staging environment
  5. Merge pull request to master
  6. Continuous Integration server build succeeds on master branch
  7. Ready to deploy to production

We use feature branches for every change, we keep them very short lived (days, not weeks) and even one line changes are done in a branch. The isolation is important when you might be coordinating 5 or more changes in a day and the overhead of branching is negligible with Git. The actual steps in the preparation are not innovative, the trick is to keep the flow to less than an hour so that even the tiniest change can follow the same approach as bigger ones. It also requires a fast build, and a fast Continuous Integration server.

Primarily though, it requires you to develop in the smallest increments possible, and this is where you need to avoid taking the easy way out.

Development

In practice the chunking of development work tends to gravitate over time to the shortest time between deployments. If your release cycle is four weeks, you’ll end up thinking about work in granularities of 2-4 weeks. If you can reduce the minimum time between deployments to zero, then you end up inverting your thinking. Instead of working out what you can fit into the release, you start to think about how small can you make the releases.

This means for example that you can:

  • perform a refactor and release it before you start the story that required it
  • release a two-line change five minutes after you discovered you need it
  • incrementally performance tune controller actions with real traffic
  • dark launch features so that you can use them with production data or show them to beta users

Dark launching with feature toggles is what helps alleviate the problems of large features and long-lived branches. We isolate the feature behind a switch (or just don’t link to it anywhere) and keep pushing out updates without affecting users or diverging from master.

Here is what the code for the switch looks like in practice:

1
2
3
4
5
6
7
8
9
if experiment_enrolment_for("feature_name").variant?

# variant code goes here

else

# normal code goes here

end

We have a screen which shows the features that exist and allow an administrator to turn it on for themselves. It also contains information about the experiment which I won’t cover in greater detail at this point.

The other powerful element that small releases opens up the realm of multiple stage releases. What are these? Primarily these occur around database refactorings when you want zero downtime. Lets take the example of breaking a table into two (or more) tables, such as might occur when you have too many columns and your model is starting to get bloated. Here are the releases you need to do:

  1. Release migration to create the new table
  2. Release code which double writes to both the new and old tables
  3. Copy the rows from before the double-writing from the old table to the new ones
  4. Release code which only reads from the new tables
  5. Release migration to drop old columns (being mindful of locks on large tables)

Voila! Table refactoring with zero downtime.

Planning

When you have a pipeline to production then your planning complexity drops completely. Depending on your user base, you can roll out multi-page changes progressively rather than in one hit. You may decide to be continually releasing behind a feature toggle (dark launching) and have your staff using the features as they develop, in which case the final release becomes a single line change to permanently set the toggle to on. We’ve built up a reasonable set of patterns around dark launching: admininstration screens for toggling on and off for staff users and landing pages with switches so that we can share the beta (or alpha!) feature with trusted users.

Planning begins to allow focus much more on validating ideas which inevitably drives how we do product development. Experimentation becomes part and parcel of our regular cycle of development. I will cover this in more detail in a subsequent post.

Team Ethos

There is a really important aspect to making this whole thing work that should be clearly stated. We have no operations staff who get handed a release to deploy. There are no testers taking a developers code and making sure it works. I haven’t seen a business analyst for nearly four years. Nor project managers wielding timelines and telling people what to do.

What we have is a strong belief that those who release the code should feel ownership of the change that has just gone out. To achieve that everyone has to feel responsibility for the end to end process. An engineer studies a problem, devises a solution and then validates that the problem is now solved. To that end, every member of the team performs all parts of the software development process. Engineers analyse and research user behaviour to look for opportunities.

The outcome of this team dynamic is that very little coordination is required for a release. Teams can be the smallest possible size, allowing for pairing, as every member sees analysis, coding, testing and releasing as their responsibility. The standard we uphold ourselves to is that a good idea had in the morning should be able to be in front of users before the day ends.

Measuring a Few Things With Statsd

| Comments

A little while ago, we realised that at having information about what was happening on Redbubble right now would be useful for many things; including tracking and alerting us to the sorts of issues that wouldn’t necessarily be caught by more traditional means (such as Airbrake alerting).

Having come across this post by Etsy, we looked at statsd as a way to collect this information in a quick way without adding much processing overhead.

Essentially the statsd daemon runs on a host (or number of hosts, and you can push all sorts of data directly to it, via a UDP port (so it’s extra fast). We then had statsd periodically (e.g. every 5 minutes) push it’s aggregated data out to Circonus, which can graph and alert based on that data.

Calls from ruby code are extremely simple, for example:

1
StatsdClient.increment("Some interesting counter")

What if you want to monitor something, and don’t even want to start up a Rails environment to do it? We use RabbitMQ as a messaging system between some of the components at Redbubble, and we wanted to graph the message levels of several queues we have set up. To do this, we wrote a simple bash script:

1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash

queues=("queue.widgets"  "queue.frobbles"  "queue.gizmos")
DATA=`sudo rabbitmqctl list_queues`
STATS=""
for queue in "${queues[@]}"
do
  GUAGE=`echo "$DATA" | grep ${queue} | awk '{print $2}'`
  STATS="$STATS${queue}:$GUAGE|g"$'\n'
done

echo "$STATS" | nc -w 1 -u statsd.hostname.com 8125

the queue names have been changed to protect the innocent…

What this script does is take the output of the rabbitmqctl command:

1
2
3
4
Listing queues ...
queue.widgets 1
queue.frobbles    3
queue.gizmos  0

and uses grep and awk to convert it into a format statsd understands:

1
2
3
queue.widgets:1|g
queue.frobbles:3|g
queue.gizmos:0|g

which we then just use netcat (nc) to push this to UDP port 8125, where statsd is listening. This script can be run via a cronjob regularly, say, every minute, with very little overhead or startup time.

Without needing to do any more configuration, these counters start to show up in Circonus, and we can then show graphs of the data:

Easy!

Redbubble’s Rocky Road to Rails 3.0 and Ruby 1.9

| Comments

The task

Here at Redbubble we’ve been running Ruby on Rails since day one. We’re a small development team, so keeping up with even the latest stable release has been a struggle. Earlier this year we had a gap in product development and took the opportunity to move our stack forward. We’d been on Ruby 1.8.7 and Rails 2.3 for a year or two. After some investigation we decided we’d first move to Rails 3.0 which would make Ruby 1.9 an easier option. After significant library updates and code compatibility changes, we were ready to release Rails 3.0 on Ruby 1.8.7.

A First Hack Day at Redbubble

| Comments

Over the past few months the team had been getting enthused over the idea of a Hack Day and there had been much discussion on how to make it happen. Very few of us had first hand experience with the concept so taking the philosophy of walk before you run, we decided to do a small-scale run within the Engineering team before moving onto a bigger event. So we nominated a date, started listing some projects and come the day, ended up with three teams of two to get going.

The day started with a brief kickoff session whereby each team gave an outline of their project and what they hoped to achieve, taking feedback from the rest of the group. Building commenced, until we broke for a lunchtime break and chat. Not for too long as everyone seemed quite keen to push on and reach a conclusion before the end of the day, which took the form of some short demonstrations from each team.

We ended up with three quite diverse projects, some of them quite close to our core product which was an interesting outcome. I caught up with each team to get a rundown on how the day went for them.

Nokogiri Goes Bump (or Segfaults) in the Night…

| Comments

Recently we’ve been working on upgrading the version of Ruby the Redbubble site runs on, from 1.8.7 to 1.9.3. We’re doing this for a number of reasons, including improved performance, new language features, and trying to stay relatively current with our tech.

Not for Recycling by Flibble

Not for Recycling by Flibble

Things went fairly smoothly until we hit a problem where we could get one of our rspecs to segfault (i.e. actually crash the ruby interpreter) every time we ran it on our (OSX Lion) development machines:

Discovering Ruby 1.9.3 YAML Parser Performance

| Comments

This is a tale of one of those times where it is actually useful to do code profiling and benchmarking. A bit of background: We’ve been in the process of upgrading our Ruby on Rails stack here at RedBubble, going from Rails 2.3 to 3.0, with an eye to moving forward to 3.2. Through this, to minimise the amount of change happening at once, we’ve stuck with Ruby 1.8.7, instead of jumping right to 1.9.3.

When we rolled out Rails 3, we noticed a significant slowdown in the general performance of the site, even though the benchmarks we had run at the time indicated that it should be okay. It has been widely reported that Ruby 1.9.3 has in general, far greater performance than 1.8.7, and we wanted to know if moving to the newer language version would bring us the sorts of performance improvements we were looking for.

Clearly some more investigation was required. We chose one of the slower pages on the site; the shop page (e.g http://redbubble.com/shop/recent+t-shirts) accounts for 47% of all the processing time on RedBubble. This is due to it’s complexity - it displays multiple “configured” products, multiple images with links, uses solr to return results from the search terms, hits the database, and renders several partials with fairly complex markup.

Solr Tuning at Redbubble

| Comments

Recently we noticed that one of our Solr slave processes (a piece of software we use to power the search feature at RedBubble) was taking longer to respond to search queries than usual - enough to trigger a warning alert on our monitoring systems.

On first inspection, we saw that the machine was using a lot of memory - enough to go into swap space, but this was sitting around the same level as our other server, which was not experiencing problems.

A little more digging, and we noticed, thanks to the JVM monitoring provided by NewRelic, that the Solr process was spending a lot of time (around 15% on average, but up to 50%) doing garbage collection. This means that 15% of the time, Solr is busy not answering search queries. Point (1) on the diagram shows this: