Quantcast
Channel: Sam Saffron's Blog - Latest posts
Viewing all articles
Browse latest Browse all 150

Live restarts of a supervised unicorn process

$
0
0

We have all seen this dreaded screen before.

In the Rails case this usually happens during application restarts.

While Discourse is rapidly evolving we are heavily encouraging users to upgrade frequently, even weekly. If your site is regularly erroring out, users very quickly lose confidence. In the ideal case you want zero downtime deploys. This feature heavily encourages users to deploy more rapidly.

Unicorn has built-in support for live restarts, however getting this to play well with a supervisor like say runit is not easy. Underlying pids are changing and stuff gets complicated fast.

To tackle this I decided to create a simple bash script that acts as a mini-supervisor for unicorn.

However, before any of this I needed some sane way of measuring how well I did.

Measuring uptime during a live restart

Traditionally you would use apache bench for quick and dirty testing, however it did not fare well for me. Unfortunately ab has no way of "throttling" the amount of requests it sends out. To measure uptime we need to perform a request to the site every N millisecond.

I ended up knocking up a quick and dirty apache bench clone that allows me to trickle through requests:

require "optparse"
require "uri"
require "net/http"

duration = 10
per_second = 10

opts = OptionParser.new do |opts|
  opts.banner = "Usage: bench_web [options] url"

  opts.on("-t", "--time TIME", OptionParser::DecimalInteger, "Duration to run the test in seconds (default 10)") do |t|
    duration = t
  end

  opts.on("-p", "--per-second REQUESTS", OptionParser::DecimalInteger, "Max number of requests per second (default 10)") do |t|
    per_second = t.to_f
  end

end

opts.parse!

if ARGV.length != 1
 puts opts.banner
 puts
 exit(1)
end


uri = begin
        URI(ARGV[0])
      rescue
        puts opt.banner
        puts
        puts "Invalid URL"
        puts
        exit(1)
      end

GC.disable

finish_time = Time.now + duration
results = []
while (start=Time.now) < finish_time
  res = Net::HTTP.get_response(uri)
  req_duration = Time.now - start
  results << {duration: req_duration, code: res.code, length: res.body.length}

  GC.enable
  GC.start
  GC.disable

  padding = (1 / per_second.to_f) - (Time.now - start)
  if padding > 0
    sleep padding
  end
  putc "."
end

GC.enable

puts
puts "Results"
puts "Total duration: #{duration} second#{duration==1?"":"s"}"
puts "Total requests: #{results.length}"

summary = results.group_by{|r| r[:code]}.map{|code, array| [code, array.count]}.sort{|a,b| a[1] <=> b[1]}

failures = summary.map{|code, count| code == "200" ? 0 : count}.inject(:+)

if failures > 0
 puts "Estimated downtime: #{((failures.to_f * (1.to_f / per_second)) * 1000).to_i}ms"
end

puts
puts "By status code: #{summary.map{|code,count| "[#{code}]x#{count} "}.join}"

puts ""

puts "Percentage of the successful requests served within a certain time (ms)"

good_requests = results.find_all{|r| r[:code] == "200"}.map{|r| r[:duration]}.sort

if good_requests.length > 0
  [25,50,66,75,80,90,95,98,99,100].map{ |percentile|
    time = good_requests[((percentile.to_f / 100.0) * (good_requests.length-1)).to_i]
    puts "  #{percentile}%\t\t#{(time * 1000).to_i}"
  }
end

For example, if I run it against a site that is restarting without any fancy help I can see:

$ ruby ./bench_web.rb -t 10 http://l.discourse/
Total duration: 10 seconds
Total requests: 82
Estimated downtime: 3700ms

By status code: [502]x37 [200]x45 

Percentage of the successful requests served within a certain time (ms)
  25%		16
  50%		18
  66%		19
  75%		20
  80%		20
  90%		21
  95%		21
  98%		58
  99%		58
  100%		1854

Not too good, that is 3.7 seconds of downtime while flipping this process.

Supervising unicorns

The standard way to do live restarts with unicorn (assuming you are preloading the app) is to send a USR2 signal to the master process, wait for it to launch a new master and the send a TERM to the old master. However, this plays really badly with supervisors that need pids not to change.

To work around this I created a simple bash file that acts as a proxy. It has a stable pid and takes care of signalling and restarting the unicorn it is running. Send it a USR2 and it will initiate the process.

#!/bin/bash

# This is a helper script you can use to supervise unicorn, it allows you to perform a live restart
# by sending it a USR2 signal

LOCAL_WEB="http://127.0.0.1:3000/"

function on_exit()
{
  kill $UNICORN_PID
  echo "exiting"
}

function on_reload()
{
  echo "Reloading unicorn"
  kill -s USR2 $UNICORN_PID
  sleep 10
  curl $LOCAL_WEB &> /dev/null
  NEW_UNICORN_PID=`ps -f --ppid $UNICORN_PID | grep unicorn | grep -v worker | awk '{ print $2 }'`
  kill $UNICORN_PID
  echo "Old pid is: $UNICORN_PID New pid is: $NEW_UNICORN_PID"
  UNICORN_PID=$NEW_UNICORN_PID
}

export UNICORN_SUPERVISOR_PID=$$

trap on_exit EXIT
trap on_reload USR2

unicorn -c $1 &
UNICORN_PID=$!

echo "supervisor pid: $UNICORN_SUPERVISOR_PID unicorn pid: $UNICORN_PID"

while [ -e /proc/$UNICORN_PID ]
do
  sleep 0.1
done

Then I can run the following at any point in time to perform a coordinated live restart using

kill -s USR2 <pid>

Additionally the script will stop the unicorn it is supervising if it is killed or exited.

Added bonus, suicide channel

The script passes in the supervisor pid to the unicorn process. At this point the unicorn master can check that its supervisor is running regularly and terminate itself if for some reason somebody ran kill -9 on the supervisor script.

#unicorn conf
before_fork do |server, worker|

  unless initialized

    initialized = true

    supervisor = ENV['UNICORN_SUPERVISOR_PID'].to_i
    if supervisor > 0
      Thread.new do
        while true
          unless File.exists?("/proc/#{supervisor}")
            puts "Kill self supervisor is gone"
            Process.kill "TERM", Process.pid
          end
          sleep 2
        end
      end
    end

  end
end

Results

The results of this method are quite fantastic, zero downtime during live restarts:

Total duration: 40 seconds
Total requests: 396

By status code: [200]x396 

Percentage of the successful requests served within a certain time (ms)
  25%		6
  50%		16
  66%		17
  75%		18
  80%		18
  90%		19
  95%		20
  98%		23
  99%		40
  100%		136

I will be rolling this into my docker image, for added robustness.


Viewing all articles
Browse latest Browse all 150

Trending Articles