Quantcast
Channel: Sam Saffron's Blog - Latest posts
Viewing all 150 articles
Browse latest View live

Do you smoke test?

$
0
0

I broke Discourse yesterday.

Not in the “its just a little bit broken sense”. Instead, in the absolutely everything is broken and every page returns a 500 error code, sense.

We already do so much to ensure Discourse always works, in true Eric Ries lean startup, Rails best practice way.

We have 1800specs that exercise a huge portion of the Ruby code.

We have 80 or so tests the test some of the JavaScript (we need more, lots more)

We constantly deploy to our test servers as soon as a commit happens. Only after this succeeds will we allow deploys to production.

Once a deploy to test happens we get a group chat message informing us it succeeded with a clear link to click (that takes us to our staging environment).

Nonetheless, somehow I managed to mess it all up and deploy a junk build.

What happened?

The Rails asset pipeline is a bit of a Rubik’s Cube. For example, if in your master layout you have <%= asset_path "my_asset" %> it may work fine in your dev and test environments. However, if you forgot to set a magic switch to pre-compile the asset, well… everything breaks in production. In my case, I had it all mostly wired up and working in dev, but a missing .js extension meant that I just was not close enough.

I clicked build, everything passed and forgot to check the test site.

This is my fault, its 100% my fault.

Often when we hit these kind of issues, as developers, we love assigning blame. Blame points my way … don’t do it again … move along nothing more to see.

That is not good enough.

What is good enough?

I would like to follow a simple practice at Discourse. If you break production for any reason, we should make sure an automated system catches that kind of break next time along.

If you break production, the only thing you are allowed to work on should be the system that stops that kind of break in future.

What kind of system can avoid a totally broken build from getting out there?

The trivial thing to do is simply make a HTTP request to staging (our non customer facing production clone) and ensure it comes back with a 200 code. Trivial to add.

However, that is not really good enough.

I would like to know that the pages are all rendered properly. At least 3 key pages to start off with. The home page, a topic page and a user page.

What makes Discourse particularly tricky is that it is an Ember.js app. You only get to see the “real” page after a pile of JavaScript work happens. Simply downloading the content and testing it, is not going to cut it.

Back in the old days we would use Selenium for these kind of tests, trouble is its not really easy to automate and involves a fairly complex setup.

These days people mostly use PhantomJS, a headless WebKit browser.

Now, if you are planning on using PhantomJS I would strongly recommend using a framework like CasperJS to lean on. It does a lot of the messy work for you. For my initial humble test I decided to write it all by hand. There were quite a few reasons.

I wanted to know how the underlying APIs work. I needed a bunch of special hacks to get it to test in a particular way with special magic delays. I did not want to bring in another complex install process in to the open source project.

I ended up with this test:

/*global phantom:true */

console.log('Starting Smoke Test');
var system = require('system');

if(system.args.length !== 2) {
  console.log("expecting phantomjs {smoke_test.js} {base_url}");
  phantom.exit(1);
}

var page = require('webpage').create();

page.waitFor = function(desc, fn, timeout, after) {
  var check,start;

  start = +new Date();
  check = function() {
    var r;

    try {
      r = page.evaluate(fn);
    } 
    catch(err) {
      // next time
    }

    var diff = (+new Date()) - start;

    if(r) {
      console.log("PASSED: " + desc + " " + diff + "ms" );
      after(true);
    } else {
      if(diff > timeout) {
        console.log("FAILED: " + desc + " " + diff + "ms");
        after(false);
      } else {
        setTimeout(check, 50);
      }
    }
  };

  check();
};


var actions = [];

var test = function(desc, fn) {
  actions.push({test: fn, desc: desc});
};

var navigate = function(desc, fn) {
  actions.push({navigate: fn, desc: desc});
};

var run = function(){
  var allPassed = true;
  var done = function() {
    if(allPassed) {
      console.log("ALL PASSED");
    } else {
      console.log("SMOKE TEST FAILED");
    }
    phantom.exit();
  };

  var performNextAction = function(){
    if(actions.length === 0) {
      done();
    }
    else{
      var action = actions[0];
      actions = actions.splice(1);
      if(action.test) {
        page.waitFor(action.desc, action.test, 10000, function(success){
          allPassed = allPassed && success;
          performNextAction();
        });
      } 
      else if(action.navigate) {
        console.log("NAVIGATE: " + action.desc);
        page.evaluate(action.navigate);
        performNextAction();
      }
    }
  };

  performNextAction();
};

page.runTests = function(){

  test("more than one topic shows up", function() {
    return jQuery('#topic-list tbody tr').length > 0;
  });

  test("expect a log in button", function(){
    return jQuery('.current-username .btn').text() === 'Log In';
  });

  navigate("navigate to first topic", function(){
    Em.run.later(function(){
      jQuery('.main-link a:first').click();
    }, 500);
  });

  test("at least one post body", function(){
    return jQuery('.topic-post').length > 0;
  });

  navigate("navigate to first user", function(){
    // for whatever reason the clicks do not respond at the beginning
    Em.run.later(function(){
      jQuery('.topic-meta-data a:first').focus().click();
    },500);
  });

  test("has about me section",function(){
    return jQuery('.about-me').length === 1;
  });

  run();
};

page.open(system.args[1], function (status) {
    console.log("Opened " + system.args[1]);
    page.runTests();
});

Now… after we deploy staging we run rake smoke:test URL=http://staging.server and get a result.

Amazingly, less than a day after I wrote it, it already caught another junk build.

This is a start, I imagine that in a few months we will have a much more extensive smoke test process.

That said, if you do not have any kind of smoke test process I would strongly recommend exploring PhantomJS. Getting something basic up is a matter of hours.


Flame graphs in Ruby MiniProfiler

$
0
0

Ruby 2.0 is just out-of-the-door. My favorite new feature in Ruby 2.0 is the new highly efficient mechanism for gathering stack traces.

In Ruby 1.9.3 I would struggle really hard to gather more than 5 stack traces per 100 ms of execution. In Ruby 2.0 I am able to gather a stack trace every millisecond using the brand new caller_locations API.

This increased efficiency makes it possible to write “sampling” profilers in pure Ruby and even run them, safely, in production.

Putting it all together

While researching the brand new DTrace support in Ruby 2.0 I noticed a few people were mentioning this concept of “flame graphs”. The idea is quite simple. Lay all the sampled stack traces side by side, so you can visually identify “hot” areas where the code can be optimised.

Here is an example from Brendan Gregg’s blog.

Unfortunately, getting this to hang together in Ruby, using Ruby stack traces was fairly impractical.

  • Rails stack traces are massively deep
  • The code was all written in Perl
  • There appears to be no API to gather Ruby call stacks from DTrace
  • Ruby DTrace probes do not even run on Linux, Ruby will not compile with DTrace for Linux enabled.

After I discovered stack traces are super fast in Ruby 2.0, making this all work became a reality. I quickly read up on the excellent d3js library and got going writing something similar but tailored at Ruby and Rails.

I added a special option to MiniProfiler so you can generate flame graphs for your pages. Just add ?pp=flamegraph to the end of your URL and you will get a flame graph, even in production.

Demo and profiling session

Here is what happened when I ran the flame graph on the Discourse front page:

Interactive demo here

This report is informationally very dense, hovering on areas allow you to zero in on various problems. Let me describe a few things I can glean from this graph.

The report is zoomable, just like google maps, zoom in enough and you start seeing method names.

Creating slugs is slow

12.75% of all the samples we collected have the preceding call in the stack trace.

The code is quite short.

def slug
   Slug.for(title).presence || "topic"
end

This slug method is triggering the transliterate method on the inflector, method calls there are not cheap. If we really want we can get an exact cost by wrapping the code with a counter.

def slug
   Rack::MiniProfiler.counter("slug") do
     Slug.for(title).presence || "topic"
   end
end

To resolve we can either improve the performance of the underlying transliterator, cache slugs in the DB or cache lookups in memory.

The inflector overall is slow

The preceding image describes a fairly extensive area of the execution, where active model serializers are turning in-memory data into json. The surprising fact there is that a large amount of time is spent pluralizing.

We can again add MiniProfiler instrumentation to see the actual impact:

8% or our time is spent pluralizing stuff, there are a ton of things we can do to improve this:

  • A big culprit is the include! method in AM::Serializers, we can easily improve that
  • We could look at adding an internal cache inside the pluralizer.

Since we are good citizens and AM::Serializers is getting a lot of attention at the moment, lets patch it up.

A fair amount of time is spent translating

We can notice that 6% of the samples happen to originate from the localization gem. The majority is originating from fallback logic. It probably makes sense adding a cache on top of translations, lookups really should be a no-op.

Object initialization is slow

I noticed this trace once on this page, but repeatedly on our topic page:

The code in question is:

def add_trust_level
  # there is a possiblity we did no load trust level column, skip it
  return unless attributes.key? :trust_level
  self.trust_level ||= SiteSetting.default_trust_level
end

Funny thing is, I wrote this code to correct a more serious perf how. Originally this code was raising exceptions and handling it. In Rails we have massive call stacks, exceptions during normal operation are a huge pitfall.

However, it appears that attributes is being materialized during this call. A very expensive operation.

Our trivial fix is:

def add_trust_level
  # there is a possiblity we did no load trust level column, skip it
  return unless has_attribute? :trust_level
  self.trust_level ||= SiteSetting.default_trust_level
end

Call stacks are really deep

I love Rack. However one side-effect of using it, is that call stacks get really really deep. In fact, we are 90 or so frames deep by the time we even start executing our application code. This make gathering stack traces slower, exceptions slower and complicates execution.

20 or so frames are spent inside omniauth this may indicate that it may make sense to have some sort of router inside omniauth for dispatch as opposed to the layered rack approach. We only have a handful of strategies in place now.

Don’t forget that the GC is totally invisible to this particular sampling profiler, at the moment.

We have found that the Ruby GC, untuned, is a huge performance problem with Discourse. I discussed this at length here: http://meta.discourse.org/t/tuning-ruby-and-rails-for-discourse/4126 . With MiniProfiler you can easily just add ?pp=profile-gc-times to the end of any request. If you are noticing multiple GCs for a single request, odds are your GC could do with a fair amount of tuning.

Future work may be integrating the GC execution times into this graph.

How you can diagnose issues

The flame graph tool is very powerful, but also can present you with information overload. My general recommendations would be:

  • Look for repeated visual patterns, stuff like the inflector above. If a pattern is repeating, over and over it could indicate a library is inefficient or a call is doing too much work.

  • Look for area you expected would be short, however spanned multiple frames. For example, if you notice a user.fullname call is taking 20 frames, its possible there is a very inefficient algorithm generating this fullname data.

  • Are there any gems you are spending way more time in, than you expected?

I love this, how can I help?

The code for the flame graph was written yesterday, it is my first foray into d3js and svg, I am sure there are many ways to improve and optimise it. MiniProfiler is open source, we would love your help improving this. All the code in the demo here is already checked in.

It is also possible that this technique can be used in other platforms. Maybe there is some ninja technique to allow JavaScript to grab stack traces for the running thread. I bet the code could be shared with the recent Google Go port of MiniProfiler. Unfortunately, I don’t think you can gather Stack Traces from existing running threads in .NET short of attaching with ICorProfiler or ICorDebugger.

In Summary

I only touched the surface here, much is left to be done, and the rendering js engine needs a lot of tidying up and refactoring. As always, pull requests more than welcome.

Ruby has a fair amount of options when it comes to profiling. There is the excellent perftools.rb, the long running ruby-prof project. The for pay New Relic.

MiniProfiler is now adding yet another elegant way to make sense of Ruby execution, in production. Be sure to try it out.

Eliminating my trivial inconveniences building Discourse

$
0
0

Many developers spend a fair amount of the day waiting:

waiting place

Waiting for a build to end
Waiting for search results to come up
Waiting for an editor to launch
Waiting for tests to run
Waiting for a boss to give instructions

There is also a more insidious form of waiting. The zero value work you tend to do repetitively, also known as trivial inconveniences.

These issues don’t stop you from working, they just make your job slightly more annoying.

For example:

  • Every time you hit the “refresh” button to see how your CSS change modified the site you are working on.
  • Every time you manually run your test suite.
  • Every time you “compile” your project and wait for it to finish.

Computers could do these tasks automatically; instead, like trained monkeys, we often repeat these pointless tasks tens and hundreds of times daily.

While developing Discourse I pay very special attention to waiting and trivial inconveniences. I work tirelessly to clear friction. In fact, I will spend days to eliminate particularly bothersome bottleneck.

I use Vim

Vim is tough editor to pick up, especially late in your career. It is notoriously difficult to master.

Fortunately, people getting started today with Vim have some amazing resources such as Practical Vim by Drew Neil, Vimcasts and VimGolf that make the process less painful.

Vim helps me eliminate many trivial inconveniences.

  • It launches instantly, I am never waiting for my editor to start up.
  • I use a fair amount of Vim plugins. Syntastic, rails.vim, NERDTree, surround.vim, tcomment and many others. The plugins I use help meld Vim so it works really well with my workflow.
  • I have some Discourse specific bindings.
    • I bind CTRL-A to touch a restart file, I can bounce my web server in a key stroke if needed.
    • I bind CTRL-S to dump a message in the Discourse messaging bus that forces all web browsers pointing at local Discourse to refresh.
    • I have a variety of navigation hot keys to cut through the JavaScript files, :Rdmodel t<tab>… takes me to the topic js model.

I am not afraid to switch tools

I have used Ack for a very long time to search through code files, recently I discovered the silver searcher. Silver searcher is faster than Ack, to me this makes all the difference. To quote Geoff Greer, the author.

I can already hear someone saying, “Big deal. It’s only a second faster. What does one second matter when searching an entire codebase?” My reply: trivial inconveniences matter.

I am not afraid to spend money on software

I have Windows 8 Desktop (with 2 monitors) and a MacBook Pro Retina on my desk. Initially, I used to reach over to the laptop to type in commands in order to test stuff out. Reaching over to the laptop is not a mammoth effort but it is very inconvenient. Due to this trivial inconvenience I rarely used my laptop during the day.

I downloaded Synergy and tried sharing my keyboard with my Mac laptop. It worked… sometimes. It crashed and failed a lot and was a monster to configure. Many developers would stop here. Free does not work, go shopping, for something else free. I disagree with this attitude.

There are a couple of paid alternatives. Share mouse worked very well for me. In a blink I spent the 50 or so bucks to eliminate these inconveniences. Why save 50 bucks just to have a $2000 laptop sit on your desk collecting dust?

Live CSS refresh

Whenever I save a CSS file my browsers automatically refresh with the amended visual style. This is accomplished with a very Spartan amount of code.

In our Guardfile:

module ::Guard
  class AutoReload < ::Guard::Guard

    require File.dirname(__FILE__) + '/config/environment'
    def run_on_change(paths)
      paths.map! do |p|
        hash = nil
        fullpath = Rails.root.to_s + "/" + p
        hash = Digest::MD5.hexdigest(File.read(fullpath)) if File.exists? fullpath
        p = p.sub /\.sass\.erb/, ""
        p = p.sub /\.sass/, ""
        p = p.sub /\.scss/, ""
        p = p.sub /^app\/assets\/stylesheets/, "assets"
        {name: p, hash: hash}
      end
      # target dev
      MessageBus::Instance.new.publish "/file-change", paths
    end

    def run_all
    end
  end
end

guard :autoreload do
  watch(/tmp\/refresh_browser/)
  watch(/\.css$/)
  watch(/\.sass$/)
  watch(/\.scss$/)
  watch(/\.sass\.erb$/)
  watch(/\.handlebars$/)
end

In our devlopment js file it does something like:

return Discourse.MessageBus.subscribe("/file-change", function(data) {
  return data.each(function(me) {
    var js;
    if (me === "refresh") {
      return document.location.reload(true);
    } else {
      return $('link').each(function() {
        if (this.href.match(me.name) && me.hash) {
          if (!$(this).data('orig')) {
            $(this).data('orig', this.href);
          }
          this.href = $(this).data('orig') + "&hash=" + me.hash;
        }
      });
    }
  });
});

}

There are plenty of solutions out there that achieve the same goal, there is LiveReload. The folks at Live Reload also documented some other options.

I chose to roll out our own solution cause I wanted better control. Our code knows how to reload Ember templates and will grow to do more sophisticated things.

I spend time making the development environment fast

Discourse has a very large number of JavaScript and CSS files. This means that during development the web server has to serve out over 370 assets to the clients. This is certainly an edge case for Rails. However, as client side frameworks like Ember and Angular become more popular this is going to be a problem more people are going to face.

I committed a fix to Discourse that cuts down the time it takes to refresh a page down from 4.5 seconds to 1 second in Development. I also opened an issue to help decide if we want to include the optimisation in Rails or not. Unlike more common attempts, that bundle up all assets in Dev to “work around” the problem, this fix is transparent and does not compromise debugging. In future when source maps are more pervasive, hacks like this may not be needed. I am a pragmatist though, this solves our problem now. Why wait?

I also committed a fix that dramatically improved sprockets (the Rails Asset Pipeline) in dev mode.

I have effective Rails development environment

I have a reasonably fast workstation, though I am due for a Haswell upgrade next month. I have been running dedicated SSDs for development for years now. I have a multi-monitor setup primary one is a 30 inch Dell.

my desktop
larger image

Most Rails developers out there are probably using Ruby 1.9.3 (or 1.8.7) untuned to work locally. If I start a Rails console to Discourse on an untuned stack it takes 16 seconds to boot up. I run Ruby 2.0 in Development with the following environment vars:

export RUBY_GC_MALLOC_LIMIT=1000000000
export RUBY_HEAP_SLOTS_GROWTH_FACTOR=1.25
export RUBY_HEAP_MIN_SLOTS=800000
export RUBY_FREE_MIN=600000
export LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so

This reduces Discourse startup time to just under 4 seconds. That is a 4x improvement just by tuning my dev machine. Additionally, I have experimented with Zeus and Spring (Discourse is pre-configured to work with Spring) with varying degrees of success. When these forking apps work you get 1 second startups.

We use better errors that works many times better than the default Rails error page and MiniProfiler that allows me to quickly debug through issues.

I really hate waiting for tests

When I started working with Discourse we had a lightning fast test suite. I could easily work on an area of code and get immediate feedback about my specs.

Fast forward a few months, we now have a test suite that takes 3 minutes to run, after you pull some ninja moves to get it that fast.

I agree it takes too long, I would like to refactor and speed it up. However, this is not the point of this section.

There are currently 2 common spec runners used by the Ruby community, autotest and Guard. Guard is more modern and actively developed.

Trouble is both runners are very troublesome for my workflow.

Both runners insist on running every single spec in the queue before obliging and running the spec you want them to run. So, my typical workflow used to be.

  1. Launch bundle exec guard
  2. Write my broken test
  3. Save
  4. Wait up to 3 minutes for the old run to finish so I can see it really failed
  5. Fix it
  6. Save
  7. See it passed
  8. GOTO (2)

So, a lot of waiting. But its critical I run the entire test suite often cause many of our tests unfortunately integrate.

To resolve my large amount of pain I took the time to write a test runner that works really well for me. https://github.com/discourse/discourse/blob/master/lib/autospec/runner.rb .

  • It will interrupt the current testing when it detects I want to run a spec (cause I saved a file)
  • It will resume the old testing after it finished running the spec I wanted it to run.
  • It will automatically focus on a broken test (only run a single test) something that allows me to spring puts statement around my code to quickly inspect state for the broken spec.
  • It allows me to switch focus to another spec (by saving a spec) even in a failed state.

For me, having a test runner that operates as I would expect it to has completely transformed the way I work. I no longer approach testing as a drudge, it is fun again.


Eliminating trivial inconveniences makes me a happier developer. Go ahead, find one inconvenience you have and eliminate it. You will not regret it.

Discourse as my blogging platform

$
0
0

This Blog has gone through several incarnations. When I started blogging I used subtext. Then I decided it was time to reinvent wheels and rolled my own, hodge-podge, blogging solution. It worked ok, but I found I had extreme difficulty engaging with the community. People would leave replies and I had no idea if they ever got my response. Further more creating content was becoming a huge pain.

I ended up using the Discourse editor for some posts and then pasting the body of the post here. Crazytown.

So I had this idea, what if I could have Discourse turtles all the way down.

I took some time to extend out plugin story and using the magic of Rails engines managed to host my blog inside Discourse.

You can see the full source code of this blog as a Discourse plugin at: https://github.com/SamSaffron/blog

I am hosted on Digital Ocean (a 2GB instance).

I am also experimenting with some less traditional Discourse configs:

  • I am running Rails 4
  • I am running unicorn, with Rack Hijack to handle long polling
  • I am running no bluepill, I tricked unicorn into demonizing sidekiq and use an upstart job. Let's see how it goes.

All of this and more deserves a much more detailed blog post.

Before any of this, let me address the elephant in the room. Commenting on this blog just became way more complicated. I plan to address that, in a future update, this is a temporary state. I want to bring back the traditional commenting box.

That said, I have email integration set up, so now when I respond to you, you will get an email you can respond to. You will not even need to visit this site to spam my blog smile

Overall, I am super excited about this. It pushes the limits of what can be done with Discourse and leaves breadcrumbs for others to follow.

More to come, now that blogging became so much easier.

Discourse in a Docker container

$
0
0

Deploying Rails applications is something we all struggle with:

You would think that years along we would have made some progress in this front, but no, deploying Rails is almost as complicated as it was back in the mongrel days.

Yes, we have passenger, however, getting it installed and working with rvm/rbenv is a bit of a black art, and let us not mention daemonizing Sidekiq or Resque. Or, god forbid, configuring PostgreSQL.

This is why we often outsource the task to application as a service providers.

Last week I decided to spend some time experimenting with Docker

What is Docker?

an open source project to pack, ship and run any application as a lightweight container

The concept is that developers and sysadmins can author simple images, usually authored using Dockerfiles that provide a pristine state that encapsulate an application. It uses all sorts of trickery to make authoring of these images a painless experience and contains a central repo where users can share images.

Think of it as a VM without the performance penalty of having a VM. Docker containers run in the same kernel as the host unvirtualized.

When a user launches a "container" a private unique IP is provisioned and the process runs isolated. Docker will launch a single process inside the container, however that process may spawn others.

Docker (today: version 0.6.5) is a front end that drives Linux LXC containers and uses a copy-on-write storage engine built on AUFS. It is the "glue" that gives you a simple API to deal with containers and optionally run them in the background, persistently.

Docker is built in golang, and has a very active community.

Restrictions

Docker version 0.6.5 is still not deemed "production ready", the technologies it wraps are considered production ready, however the APIs are changing rapidly with some radical changes to come later on.

There are plans to extract the AUFS support and probably use lvm thin provisioning as the preferred storage backend.

As it stands the only recommended OS to run Docker on by the Docker team is Ubunu LTS 12.04.03 (note, LTS ships with multiple kernels, you need 3.8 at least). I have had luck with Ubuntu 13.04, however 13.10 does not work with Docker today (since is ships with an incompatible, alpha, version of lxc). Additionally you should be aware of a networking issue in VMs that affect 3.8.

The AUFS dependency is the main reason for this tough restriction, however I feel confident this is going away, Red Hat are banking on it.

Security

It is very important to read through the LXC security document. Depending on your version of LXC, the root use inside a container may have global root privileges. This may not matter to you, or it may be very critical to you depending on your application / usage.

Additionally file mounts are a mess, if you mount a location external to the docker container using the -v options for docker run permissions are all a bit crazy. UIDs inside docker do not match UIDs outside of it, so for example:

View from the outside

View from inside the container.

There are plans to mitigate this problem. It can be worked around with NFS shares, avoiding mounts or synchronizing users and groups between containers and host.

The 42 layer problem in AUFS

AUFS only supports 42 layers. It may seem like a lot, but you hit is very early when building complex images. Dockerfiles make if very easy to reuse work when building images. For example, say I am building an image and decide to add "one more thing". When I add a new RUN command, docker is smart enough to re-use all my previous work so building the image is snappy. As a result many docker files contain lots and lots of RUN commands.

To circumvent this issue our base image is built as a single layer. When I am experimenting with changes I add them at the end of the file, eventually rolling them in to the big shell command.

Gotchas developing with Docker

When developing with Docker it is quite easy to accumulate a pile of images you never use, and containers that have long ago stopped and are disposable. It is fairly important to stay vigilant and keep cleaning up. Any complex docker environments are going to need a very clean process for eliminating unnecessary containers and images.

While developing I found myself running the following quite a lot:

docker rm `docker ps -a  | grep Exit | awk '{ print $1 }'`

remove all containers that exited

This blog is running on Docker

There has been a previous attempt to run Discourse under Docker by srid. However I wanted to take a fresh look at the problem and in a "trial-by-fire" come up with a design that worked for me.

Note, this setup is clearly not something we will be supporting externally or would like made official quite yet, however it has enormous amount of appeal and potential. After working through a Docker Discourse setup with our awesome sysadmin supermathie he described it as "20% of the work" he usually does.

This is how you would work through it

  • Install Ubuntu 12.04.03 LTS
  • sudo apt-get install git
  • git clone https://github.com/SamSaffron/discourse_docker.git
  • cd discourse_docker, run ./launcher for instructions on how to install docker
  • Install docker
  • Build the base image: sudo docker build image (note the hash of the image)
  • Tag the image: sudo docker tag [hash of image] samsaffron/discourse
  • Modify the base template to suite your needs (standalone.yml.sample):

# this is the base template, you should not change it
template: "standalone.template.yml"
# which ports to expose?
expose:
  - "80:80"
  - "2222:22"

params:
  # ssh key so you can log in
  ssh_key: YOUR_SSH_KEY
  # git revision to run
  version: HEAD


  # host name, required by Discourse
  database_yml:
    production:
      host_names:
        # your domain name
        - www.example.com


# needed for bootstrapping, lowercase email
env:
  DEVELOPER_EMAILS: 'my_email@email.com'
  • Save it as say, web.yaml
  • Run sudo ./launcher bootstrap web to create an image for your site
  • Run sudo ./launcher start web to start the site

At this point you will have a Discourse site up and running with sshd / nginx / postgresql / redis / unicorn running in a single container with runit ensuring all the processes keep running. (though I still need to build in a monitoring bits)

At no point during this setup did you have to pick the redis and postgres version, or mess around with nginx config files. It was all scripted in a completely reproducible fashion.

This solution is 100% transparent and hackable for other purposes

The launcher shell script has no logic regarding Discourse built in. Nor does pups, the yaml based image bootstrapper inspired by ansible. You can go ahead and adapt this solution to your own purposes and extend as you see fit.

I took it on myself to create the most complex of setup first, however this can easily be adapted to run separate applications per container using the single base image. You may prefer to run PostgreSQL and Redis in a single container and the web in another, for example. The base image has all the programs needed, copy-on-write makes storage cheap.

I elected to keep all persistent data outside of the container, that way I can always throw away a container and start again from scratch, easily.

The importance of the sshd backdoor into the container

During my work with docker I really wanted to be able to quickly be able to log-on to a container and mess about a bit. I am not alone.

A common technique to allow users direct access into a system container is to run a separate sshd inside the container. Users then connect to that sshd directly. In this way, you can treat the container just like you treat a full virtual machine where you grant external access. If you give the container a routable address, then users can reach it without using ssh tunneling.

One process per container

Docker will only launch a single process per container, it is your responsibility to launch any other processes you need and take care of monitoring. This is why I picked runit as the ideal process for this task:

compare that to the 105000 VSZ and 18700 RSS memory bluepill takes

VSZ and RSS numbers this low are probably very foreign to today's programmers, this is perfect for this task and makes orchestrating a container internally very simple. It takes care of dependency so, for example, unicorn will not launch until Postgres and Redis are running.

The upgrade problem

Docker opens a bunch of new options when it comes to application upgrades. For example, you can bootstrap a new container with a new version, stop your old container and start the new one.

You can also enable seamless upgrades on a single machine using 4 containers, a db container an haproxy container and 2 web containers. Just notify haproxy a web is going down, pull it out of rotation, upgrade that container and push it back into rotation.

Since we are running sshd in each container we can still use the traditional mechanisms of upgrade as well.

In more "enterprisey" setups you can run your own Docker registry, that way your CI machine can prep the images and the deploy process simply pulls the image on each box shuts down old containers and starts new ones. Distributing images is far more efficient and predictable than copying thousands of file with rsync each time you deploy.

Why yet another ansible?

While working on the process I came up with my own DSL for bootstrapping my Discourse images. I purpose built it so it solves the main issues I was hitting with a simple shell script. Multiline replace is hard in Awk and Grep. The syntax is scary to some, merging yaml files is not something you really could do that easily in a shell script.

pups makes these problems quite easy to solve

run:
  - replace:
      filename: "/etc/nginx/conf.d/discourse.conf"
      from: /upstream[^\}]+\}/m
      to: "upstream discourse {
        server 127.0.0.1:3000;
      }"

multiline regex replace for an nginx conf file

The DSL and tool lives here: https://github.com/samsaffron/pups feel free to use it where you need. I picked it over ansible cause I wanted an exact fit for my problem.

The initial image was simple enough to fit in a Docker file, however the process of bootstrapping Discourse is complex. You need to spin up background processes, do lots of fancy replacing and so on. You can see the template I am using for this site here: https://github.com/SamSaffron/discourse_docker/blob/master/standalone.template.yml

The future

I see a very bright future for Docker, a huge eco-system is forming with new Docker based applications launching monthly. For example CoreOS , Deis and others are building businesses on top of Docker. OpenStack Havana supports Docker out-of-the-box.

Many of the issues I have raised in this post are being actively resolved. Docker is far more than a pretty front end on the decade old BSD jail concept. It is attempting to provide a standard we can all use, in dev and production regardless of the OS we are running, allowing us to set up environments quickly and cleanly.

Live restarts of a supervised unicorn process

$
0
0

We have all seen this dreaded screen before.

In the Rails case this usually happens during application restarts.

While Discourse is rapidly evolving we are heavily encouraging users to upgrade frequently, even weekly. If your site is regularly erroring out, users very quickly lose confidence. In the ideal case you want zero downtime deploys. This feature heavily encourages users to deploy more rapidly.

Unicorn has built-in support for live restarts, however getting this to play well with a supervisor like say runit is not easy. Underlying pids are changing and stuff gets complicated fast.

To tackle this I decided to create a simple bash script that acts as a mini-supervisor for unicorn.

However, before any of this I needed some sane way of measuring how well I did.

Measuring uptime during a live restart

Traditionally you would use apache bench for quick and dirty testing, however it did not fare well for me. Unfortunately ab has no way of "throttling" the amount of requests it sends out. To measure uptime we need to perform a request to the site every N millisecond.

I ended up knocking up a quick and dirty apache bench clone that allows me to trickle through requests:

require "optparse"
require "uri"
require "net/http"

duration = 10
per_second = 10

opts = OptionParser.new do |opts|
  opts.banner = "Usage: bench_web [options] url"

  opts.on("-t", "--time TIME", OptionParser::DecimalInteger, "Duration to run the test in seconds (default 10)") do |t|
    duration = t
  end

  opts.on("-p", "--per-second REQUESTS", OptionParser::DecimalInteger, "Max number of requests per second (default 10)") do |t|
    per_second = t.to_f
  end

end

opts.parse!

if ARGV.length != 1
 puts opts.banner
 puts
 exit(1)
end


uri = begin
        URI(ARGV[0])
      rescue
        puts opt.banner
        puts
        puts "Invalid URL"
        puts
        exit(1)
      end

GC.disable

finish_time = Time.now + duration
results = []
while (start=Time.now) < finish_time
  res = Net::HTTP.get_response(uri)
  req_duration = Time.now - start
  results << {duration: req_duration, code: res.code, length: res.body.length}

  GC.enable
  GC.start
  GC.disable

  padding = (1 / per_second.to_f) - (Time.now - start)
  if padding > 0
    sleep padding
  end
  putc "."
end

GC.enable

puts
puts "Results"
puts "Total duration: #{duration} second#{duration==1?"":"s"}"
puts "Total requests: #{results.length}"

summary = results.group_by{|r| r[:code]}.map{|code, array| [code, array.count]}.sort{|a,b| a[1] <=> b[1]}

failures = summary.map{|code, count| code == "200" ? 0 : count}.inject(:+)

if failures > 0
 puts "Estimated downtime: #{((failures.to_f * (1.to_f / per_second)) * 1000).to_i}ms"
end

puts
puts "By status code: #{summary.map{|code,count| "[#{code}]x#{count} "}.join}"

puts ""

puts "Percentage of the successful requests served within a certain time (ms)"

good_requests = results.find_all{|r| r[:code] == "200"}.map{|r| r[:duration]}.sort

if good_requests.length > 0
  [25,50,66,75,80,90,95,98,99,100].map{ |percentile|
    time = good_requests[((percentile.to_f / 100.0) * (good_requests.length-1)).to_i]
    puts "  #{percentile}%\t\t#{(time * 1000).to_i}"
  }
end

For example, if I run it against a site that is restarting without any fancy help I can see:

$ ruby ./bench_web.rb -t 10 http://l.discourse/
Total duration: 10 seconds
Total requests: 82
Estimated downtime: 3700ms

By status code: [502]x37 [200]x45 

Percentage of the successful requests served within a certain time (ms)
  25%		16
  50%		18
  66%		19
  75%		20
  80%		20
  90%		21
  95%		21
  98%		58
  99%		58
  100%		1854

Not too good, that is 3.7 seconds of downtime while flipping this process.

Supervising unicorns

The standard way to do live restarts with unicorn (assuming you are preloading the app) is to send a USR2 signal to the master process, wait for it to launch a new master and the send a TERM to the old master. However, this plays really badly with supervisors that need pids not to change.

To work around this I created a simple bash file that acts as a proxy. It has a stable pid and takes care of signalling and restarting the unicorn it is running. Send it a USR2 and it will initiate the process.

#!/bin/bash

# This is a helper script you can use to supervise unicorn, it allows you to perform a live restart
# by sending it a USR2 signal

LOCAL_WEB="http://127.0.0.1:3000/"

function on_exit()
{
  kill $UNICORN_PID
  echo "exiting"
}

function on_reload()
{
  echo "Reloading unicorn"
  kill -s USR2 $UNICORN_PID
  sleep 10
  curl $LOCAL_WEB &> /dev/null
  NEW_UNICORN_PID=`ps -f --ppid $UNICORN_PID | grep unicorn | grep -v worker | awk '{ print $2 }'`
  kill $UNICORN_PID
  echo "Old pid is: $UNICORN_PID New pid is: $NEW_UNICORN_PID"
  UNICORN_PID=$NEW_UNICORN_PID
}

export UNICORN_SUPERVISOR_PID=$$

trap on_exit EXIT
trap on_reload USR2

unicorn -c $1 &
UNICORN_PID=$!

echo "supervisor pid: $UNICORN_SUPERVISOR_PID unicorn pid: $UNICORN_PID"

while [ -e /proc/$UNICORN_PID ]
do
  sleep 0.1
done

Then I can run the following at any point in time to perform a coordinated live restart using

kill -s USR2 <pid>

Additionally the script will stop the unicorn it is supervising if it is killed or exited.

Added bonus, suicide channel

The script passes in the supervisor pid to the unicorn process. At this point the unicorn master can check that its supervisor is running regularly and terminate itself if for some reason somebody ran kill -9 on the supervisor script.

#unicorn conf
before_fork do |server, worker|

  unless initialized

    initialized = true

    supervisor = ENV['UNICORN_SUPERVISOR_PID'].to_i
    if supervisor > 0
      Thread.new do
        while true
          unless File.exists?("/proc/#{supervisor}")
            puts "Kill self supervisor is gone"
            Process.kill "TERM", Process.pid
          end
          sleep 2
        end
      end
    end

  end
end

Results

The results of this method are quite fantastic, zero downtime during live restarts:

Total duration: 40 seconds
Total requests: 396

By status code: [200]x396 

Percentage of the successful requests served within a certain time (ms)
  25%		6
  50%		16
  66%		17
  75%		18
  80%		18
  90%		19
  95%		20
  98%		23
  99%		40
  100%		136

I will be rolling this into my docker image, for added robustness.

Demystifying the Ruby GC

$
0
0

This article is about the Ruby GC. In particular it is about the GC present in Ruby MRI 2.0.

The Ruby GC has been through quite a few iterations, in 1.9.3 we were introduced to the lazy sweeping algorithm and in 2.0 we were introduced bitmap marking. Ruby 2.1 is going to introduce many more concepts and is out-of-scope for this post.

Heaps of heaps

MRI (Matz's ruby interpreter) stores objects aka. RVALUEs in heaps, each heap is approx 16KB. RVALUE structs consume different amounts of memory depending on the machine architecture. On x64 machines they consume 40 bytes, on x32 machines they consume 20 to 24 bytes depending on the sub-architecture (some optimizations shave off a few extra bytes on say, cygwin using magic pragmas).

An RVALUE is a magical c struct that is a union of various "low level" c representations of Ruby objects. For example, in MRI, an RVALUE can accessed as a RRegexp or a RString or an RObject and so on. I strongly recommend the excellent Ruby Under a Microscope to get a handle of this, GC algorithms and MRI in general.

Given this, each heap in a x64 machine we are able to store about 409 Ruby objects give or take a few for heap alignment and headers.

[1] pry(main)> require 'objspace'
=> true
[2] pry(main)> ObjectSpace.count_objects[:TOTAL] / GC.stat[:heap_used]
=> 406

A typical Rails application (like say Discourse) will have about 400 thousand objects in 1100 or so heaps (heaps can get fragmented with empty space), we can see this by running:

$ RAILS_ENV=production rails c
> GC.start
> GC.stat
=> {:count=>102, :heap_used=>1160, :heap_length=>1648, :heap_increment=>488, :heap_live_num=>369669, :heap_free_num=>102447, :heap_final_num=>0, :total_allocated_object=>3365152, :total_freed_object=>2995483}

About GC.stat in Ruby 2.0

GC.stat is a goldmine of information, it is the first place you should go to before doing any GC tuning, here is an overview of what they mean, unfortunately it is not documented some attempts are not that accurate, here is my go at it, after reading the GC source:

count: the number of times a GC ran (both full GC and lazy sweep are included)

heap_used: the number of heaps that have more than 0 slots used in them. The larger this number, the slower your GC will be.

heap_length: the total number of heaps allocated in memory. For example 1648 means - about 25.75MB is allocated to Ruby heaps. (1648 * (2 << 13)).to_f / (2 << 19)

heap_increment: Is the number of extra heaps to be allocated, next time Ruby grows the number of heaps (as it does after it runs a GC and discovers it does not have enough free space), this number is updated each GC run to be 1.8 * heap_used. In later versions of Ruby this multiplier is configurable.

heap_live_num: This is the running number objects in Ruby heaps, it will change every time you call GC.stat

heap_free_num: This is a slightly confusing number, it changes after a GC runs, it will let you know how many objects were left in the heaps after the GC finished running. So, in this example we had 102447 slots empty after the last GC. (it also increased when objects are recycled internally - which can happen between GCs)

heap_final_num: Is the count of objects that were not finalized during the last GC

total_allocated_object: The running total of allocated objects from the beginning of the process. This number will change every time you allocate objects. Note: in a corner case this value may overflow.

total_freed_object: The number of objects that were freed by the GC from the beginning of the process.

When will the GC run

The GC in Ruby 2.0 comes in 2 different flavors. We have a "full" GC that runs after we allocate more than our malloc_limit and a lazy sweep (partial GC) that will run if we ever run out of free slots in our heaps.

The lazy sweep takes less time than a full GC, however only performs a partial GC. It's goal is to perform a short GC more frequently thus increasing overall throughput. The world stops, but for less time.

The malloc_limit is set to 8MB out of the box, you can raise it by setting the RUBY_GC_MALLOC_LIMIT higher.

Why a malloc limit?

Discourse at boot only takes up 25MB of heap space, however when we look at the RSS for the process we can see it is consuming way over 134MB. Where is all this extra memory?

sam@ubuntu:~/Source/discourse$ RAILS_ENV=production rails c
irb(main):008:0> `ps -o rss= -p #{Process.pid}`.to_i
=> 134036
irb(main):009:0> (GC.stat[:heap_length] * (2 << 13)).to_f / (2 << 19)
=> 26.15625

The Ruby heaps store RVALUE objects, these objects at most can store 40 bytes. For Strings, Arrays and Hashes this means that small objects can fit in the heap, but as soon as they reach a threshold, Ruby will malloc extra memory outside of the Ruby heaps. We can see an example here:

sam@ubuntu:~/Source/discourse$ irb
irb(main):001:0> require 'objspace'
=> true
irb(main):002:0> ObjectSpace.memsize_of("a")
=> 0
irb(main):005:0> ObjectSpace.memsize_of("a"*23)
=> 0
irb(main):006:0> ObjectSpace.memsize_of("a"*24)
=> 24
# peace comes at a high cost
irb(main):017:0> ObjectSpace.memsize_of("☮"*8)
=> 24

Turns out that for Rails apps the vast majority of the RSS consumption is not by Ruby heaps but by attached information to objects allocated outside of the Ruby heap and general memory fragmentation.

$ RAILS_ENV=production rails c
irb(main):005:0> size=0; ObjectSpace.each_object{|o| size += ObjectSpace.memsize_of(o) }; puts size/1024
67265

This fact puts a bit of a damper on the GC Bitmap Marking algorithm introduced in Ruby 2.0. For a large Rails app, at best, it is optimising reuse of 20% or so, further more this 20% can get fragmented which makes stuff worse.

We can explore the default malloc limit (it is 8MB out of the box). If we allocate 8 objects that are 1MB each we can trigger a GC:

$ irb
irb(main):001:0> GC.start
=> nil
irb(main):002:0> GC.count
=> 22
irb(main):003:0> 8.times { Array.new(1_000_000/8) } ; puts
=> nil
irb(main):004:0> GC.count
=> 23
irb(main):005:0> require 'objspace'
=> true
irb(main):006:0> ObjectSpace.memsize_of(Array.new(1_000_000/8))
=> 1000000
irb(main):007:0> 

Ruby protects your processes from using up all the available memory on your computer when making throw away copies of large objects.

However, this setting very outdated, it was introduced many years ago by matz when memory was scarce.

For an added bonus using very nasty hacks we can even raise this number in runtime.

sam@ubuntu:~/Source/discourse$ irb
irb(main):001:0> 15.times { Array.new(16_000_000/8) }; puts
=> nil
irb(main):002:0> GC.start; GC.count
=> 38
irb(main):003:0> 15.times { Array.new(1_000_000/8) }; puts 
=> nil
irb(main):004:0> GC.count
=> 38

MRI will raise the GC limit if it over-exhausted (a percentage each time). However, in the real world, in a real Rails app the GC limit is very unlikely to grow much during runtime, you just don't allocate huge objects regularly. So, we usually use the environment var RUBY_GC_MALLOC_LIMIT to push this number up.

Every rails app should have a higher malloc limit. The default is too small, this tiny default means that many Rails apps in the wild are getting zero benefit from the faster "lazy sweep" algorithm implemented in Ruby 1.9.3. Further more, low malloc limits mean that the GC runs way too often. Typical Rails requests will regularly allocate a couple of megs of RAM.

What should you set it to? It totally depends on the app. For Discourse we recommend 50MB. The downside of setting this too high is that you are increasing general memory fragmentation.

How much memory is a page view allocating?

rack-mini-profiler (in master) contains a very handy report to get a handle of memory use in your various pages. Just apppend ?pp=profiler-gc at the end of your url:

Overview
------------------------------------
Initial state: object count - 377099 , memory allocated outside heap (bytes) 76765247

GC Stats: count : 114, heap_used : 4283, heap_length : 4312, heap_increment : 0, heap_live_num : 459148, heap_free_num : 1283203, heap_final_num : 0, total_allocated_object : 6292870, total_freed_object : 5833722

New bytes allocated outside of Ruby heaps: 1458308
New objects: 38363

ObjectSpace delta caused by request:
--------------------------------------------
String : 18638
Array : 10053
Hash : 3229
ActiveRecord::AttributeMethods::TimeZoneConversion::Type : 1297
Rational : 790
Time : 615
MatchData : 364
RubyVM::Env : 330

Here we can see that the front page is causing 1.45MB to allocate, so out-of-the-box, without any malloc tuning we can only handle 5 requests. 5 requests only generate 190k or so objects in the heap that is way below heap free num.

We spent a lot of time tuning Rails 4 to cut down on allocations, before we started tuning this we were easily allocating double the amount for a front page request.

note: running this report unavoidably is likely to cause your Ruby heaps to grow, due to iteration through ObjectSpace with GC disabled. It is recommended you cycle your processes in production after an analysis session.

The trouble with the heap growth algorithm

Ruby heaps will grow by a factor of 1.8 (times used heap size post GC) every time heap space is hits a threshold. This is rather problematic for real world apps. The number of heaps available may increase during an apps lifecycle, but it will never decrease. Say you have 1000 heaps in play, next time heaps grow you will jump to 1,800 heaps. However, your app may have optimal performance with 1,400 heaps. Remember, the more used heaps you have the slower it will take a GC to run.

**note: the Ruby heap growth factor is configurable and adaptable in Ruby 2.1.

We have some control over the heap count using the RUBY_HEAP_MIN_SLOTS, we can tell ruby to pre-allocate heap space, unfortunately in Ruby 2.0 p247 this is a bit buggy and will result in over allocation, for example here we ask for 1000 heap slots but get 1803 in Ruby 2.0:

sam@ubuntu:~/Source$ rbenv shell ruby-head
sam@ubuntu:~/Source$ RUBY_HEAP_MIN_SLOTS=$(( 408*1000  )) ruby -e "puts GC.stat[:heap_length]"
1000

sam@ubuntu:~/Source$ rbenv shell 2.0.0-p247
sam@ubuntu:~/Source$ RUBY_HEAP_MIN_SLOTS=$(( 408*1000  )) ruby -e "puts GC.stat[:heap_length]"
1803

So, you can use this setting but be careful with it, it will over commit heap space, meaning, slower GC times. See also: http://bugs.ruby-lang.org/issues/9134

We can also attempt to control heap space with RUBY_FREE_MIN. Unfortunately this setting does not work as expected.

sam@ubuntu:~/Source$ RUBY_FREE_MIN=$(( 408*10000  )) ruby -e " GC.start; p GC.stat[:heap_length]"
81
sam@ubuntu:~/Source$ RUBY_FREE_MIN=$(( 408*20000  )) ruby -e " GC.start; p GC.stat[:heap_length]"
81

All this setting does is forces Ruby to evaluate if it needs to grow a heap, more aggressively.

Out of the box this is how the algorithm works, more or less:

  1. GC sweep runs
  2. Ruby checks if the free_num (the number of free objects in the used heaps) is smaller than free_min aka (RUBY_FREE_MIN)
  3. Ruby runs set_heaps_increment and heaps_increment
  4. set_heaps_increment checks to see if heaps_used 1.8 is larger than heaps_length ... if it is it will grow the heap by 0.8 heaps_used.

The key here is that all free_num does is trigger a check. Out of the box free_min is dynamically adjusted to 20% of heaps_used. I can not think of any reason you would really play with this setting.

The implementation is much more more intuitive in Ruby 2.1 see: http://bugs.ruby-lang.org/issues/9137

The holy grail of an out-of-band GC

A full GC can take a long time, in fact on a droplet at Digital Ocean, this blog can spend upwards of 100ms to perform a GC.

This GC stops the world and "stalls" your customers. In an ideal world you would be able to control the GC and run it between requests. As long as you have enough worker processes, this stall will be invisible to your customers.

The problem though is that it is very hard to predict when a GC will run, cause malloc information is totally invisible in Ruby 2.0. We are hoping to expose more information in Ruby 2.1.

This means that if RUBY_GC_MALLOC_LIMIT is set too low, you have no way of predicting when a GC will run.

There have been two attempts at an out-of-band-gc made public.

  1. Unicorn OOBGC http://unicorn.bogomips.org/Unicorn/OobGC.html
  2. Passenger OOBGC https://github.com/phusion/passenger/blob/master/lib/phusion_passenger/rack/out_of_band_gc.rb

Both attempts are severely flawed. In modern web apps the amount of data a page can allocate varies wildly. Some pages may allocate a tiny amount of memory and objects others lots.

You can not deterministically guess when its best to run the GC based on request count alone. This means these attempt often run the GC way too often.

Worst still they often attempt to run GC.disable which has extreme possibility of creating rogue Ruby processes with massive heaps. Once you disable the GC all bets are off. A simple loop can create an very problematic process.

irb(main):008:0> GC.disable
=> false
irb(main):009:0> 100_000_000.times{ "" } ; p
=> nil
irb(main):010:0> GC.enable
=> true
irb(main):011:0> GC.stat
=> {:count=>4472, :heap_used=>246240, :heap_length=>286126, :heap_increment=>39886, :heap_live_num=>100082676, :heap_free_num=>42424, :heap_final_num=>0, :total_allocated_object=>289369670, :total_freed_object=>189286994}
iirb(main):014:0> t=Time.now; GC.start; puts (Time.now - t)
0.15620451

There, we now have a process that takes 156ms to run the GC on bleeding edge hardware.

And let's not forget the obscene memory usage

sam@ubuntu:~/Source/discourse$ smem
  PID User     Command                         Swap      USS      PSS      RSS 
 8906 sam      irb                                0  3982736  3983692  3985700

Even with all the missing information, we can do better than a simple, flawed, request count. At Discourse I have been working on an out-of-band-GC that works quite successfully in production. Firstly we need to make sure malloc limit rarely affects us. We do so by raising it to 40MB.

Source is here: https://github.com/discourse/discourse/blob/master/lib/middleware/unicorn_oobgc.rb

It attempts to keep a running estimate of the live object count that will trigger a GC using:

# the closer this is to the GC run the more accurate it is
  def estimate_live_num_at_gc(stat)
    stat[:heap_live_num] + stat[:heap_free_num]
  end

This is extremely conservative and not that accurate. We can also experiment with:

# base on heap length
  def estimate_live_num_at_gc(stat)
    stat[:heap_length] * 408 # objects per slot 
  end

The algorithm than tries to leave room for 2 "big" requests, if it notices there is not enough room, it will preempt a GC.

This worked very successfully for us at http://discourse.ubuntu.com as can be seen when running this in verbose mode.

OobGC hit pid: 28701 req: 56 max delta: 111782 expect at: 893328 67ms saved
OobGC hit pid: 28680 req: 57 max delta: 50000 expect at: 893328 64ms saved
OobGC hit pid: 28728 req: 45 max delta: 112105 expect at: 893328 61ms saved
OobGC hit pid: 28687 req: 49 max delta: 50000 expect at: 949063 74ms saved
OobGC hit pid: 28707 req: 66 max delta: 50000 expect at: 893328 71ms saved
OobGC hit pid: 28695 req: 89 max delta: 50000 expect at: 893328 67ms saved
OobGC hit pid: 28728 req: 20 max delta: 71807 expect at: 893328 61ms saved
OobGC hit pid: 28680 req: 43 max delta: 62992 expect at: 893328 68ms saved
OobGC hit pid: 28701 req: 75 max delta: 50000 expect at: 893328 73ms saved
OobGC hit pid: 28707 req: 52 max delta: 50000 expect at: 893328 68ms saved
OobGC hit pid: 28695 req: 34 max delta: 81301 expect at: 893328 61ms saved
OobGC hit pid: 28687 req: 68 max delta: 50000 expect at: 949063 74ms saved
OobGC hit pid: 28728 req: 69 max delta: 50000 expect at: 893358 69ms saved
OobGC hit pid: 28701 req: 39 max delta: 73273 expect at: 893328 61ms saved
OobGC hit pid: 28695 req: 47 max delta: 115067 expect at: 893328 65ms saved
OobGC hit pid: 28707 req: 48 max delta: 185909 expect at: 893328 68ms saved
OobGC hit pid: 28680 req: 85 max delta: 50000 expect at: 893328 68ms saved
OobGC hit pid: 28695 req: 20 max delta: 52118 expect at: 893328 62ms saved
OobGC hit pid: 28687 req: 63 max delta: 50000 expect at: 949063 73ms saved
OobGC hit pid: 28728 req: 42 max delta: 64944 expect at: 893328 63ms saved
OobGC hit pid: 28680 req: 41 max delta: 138184 expect at: 893328 65ms saved
OobGC hit pid: 28701 req: 50 max delta: 50000 expect at: 893328 70ms saved
OobGC miss pid: 28707 reqs: 50 max delta: 50000

Once in a while you get a miss, cause it is impossible to predict malloc and potentially massive requests, however, in general it helps a lot. You can see the out-of-band-gc kicking in at different request counts, sometimes we can handle 20 requests between GCs, other times 80. As an added bonus, you don't need to run unicorn_killers and risk is very low.

Keep exploring

Given the built in tooling and Mini Profiler, you are not running blind, you can do quite a lot to investigate and understand your GC behavior.

Try running these snippets and tools, try exploring.

Many very exciting changes both to GC algorithms and tooling are forthcoming in Ruby 2.1 thanks to work by Koichi Sasada, Aman Gupta and others. I hope to blog about it.

Special thank you to Koichi for reviewing this article.

Commenting powered by Discourse

$
0
0

Leaving comments on this blog requires a certain amount of commitment. You have to jump to another site to log in.

Compare this to the "least amount of friction possible".

A lot of work.

When I made the move to Discourse I thought of this state-of-affairs as a temporary situation. I would add the "traditional" comment box at the bottom and make it super easy to add comments, I would transparently create accounts and all that jazz.

A couple of months in, I am not so sure.

When I think about comments on my blog these are my priorities.

  1. Give users room to type in interesting and insightful comments.
  2. Provide great support for followup (reply by email, email notifications)
  3. Rich markdown editor with edit and preview support.
  4. Comment format must be markdown, anything else and I am risking complex conversion later on.
  5. Zero spam
  6. Comments / emails / content is unconditionally hosted on my server and under my control. Not hosted by some third party under their rules with their advertising injected and my readers tracked.
  7. Trivial for me to moderate.

You may notice that, "Make it super easy for anybody on the Internet to contribute a random unfiltered opinion" is surprisingly missing from this list.

What does Discourse score in my 7 point dream list? A solid 7 out of 7. The extra friction completely eradicated spam and enables me to have rich conversations with my readers. I have their emails, I can communicate with them.

About Spam

@kellabyte

Trying out disqus cuz I just got 600 more spam comments on my blog since earlier. Gotta end this madness.

I eliminated the vast majority of spam on this blog a while back. I blogged about it 2 years ago. Critics claimed that this approach was doomed to fail if it ever got popular.

Discourse is popular. Yet I get zero spam. By zero I mean that in the last 55 days I got no spam on this blog, nor did I have to delete any spam from this blog.

http://discuss.samsaffron.com the site that takes comments for this blog is excluded from Google using a robots.txt rule, coupled with the built in immune system Discourse already has, this leaves spamming software very confused. Not only do they need to run a full PhantomJS like engine, they also need to register accounts and know about the tie discuss.samsaffron.com has to this blog.

Too much work, so no Rolex for me. This sucks cause I really want a v8gra branded Rolex watch.

Do I care about all the missing comments I am not getting?

My priority has shifted strongly, I would prefer to engage in interesting conversations as opposed to collecting a massive collage of "great job +1" comments. My previous blog post is a great example. I have room to expand points, add code samples and so on. I have the confidence that my replies will be read by the people who asked for the extra info.

I often hear people say "just disable comments on your blog" as a general solution for low quality and spam. I feel I have found a different way here, and I love it.

I am not sure I will bring back the trivial "add comment" text box.


Call to Action: Long running Ruby benchmark

$
0
0

I would love a long running Ruby and Rails set of benchmarks. I talked about this at GoGaRuCo and would like to follow up.

For a very long time Python has had the pypy speed center:

Recently, golang has added its own: http://goperfd.appspot.com/perf

Why is this so important?

Writing fast software requires data. We need to know right away when our framework or platform is getting slower or faster. This information can be fed directly to the team informing them of big wins and losses. Often small changes can lead to unexpected gains or losses.

Finding out about regressions months in to the development cycle can often incur a massive cost, fixing bugs early on is cheap. Usually the longer we wait the more expensive it is to fix.


Source: NASA

Imagine if for every Rails major release the team could announce not only that it is N% faster in certain areas but also attribute the improvements to particular commits.

Imagine if we could have a wide picture about the performance gains and losses of a new Ruby versions, given full context to the reason why something slowed down or sped up.

What we have today?

We have a fair amount

The Discourse benchmark can be used to benchmark a "real world" app, it integrates the entire system. The other small microbenchmarks can be used to bench specific features and areas.

The importance of stable dedicated hardware

The server provisioned by Ninefold is a bare metal server, once we ensure power management is disabled it can be used to provide consistent results, unlike virtual hosts which are often dependent on external factors. We need to produce results and reproduce them on multiple sets of hardware to ensure we did not mess up our harness.

Tracking all metrics

There are a sea of metrics we can gather. GC times, Rails bootup times, Memory usage, Page load times, RSS, Requests per second and so on. We don't need to be shoe horned into tiny micro benches. When tracking performance we need to focus on both the narrow and wide.

Often performance is a game of trade-offs, you make some areas slower so some other, more important areas, become faster.

Raw execution speed, memory usage and disk usage all matter. Our performance can depend on memory allocators (like jemalloc or tcmalloc) and GC tuning, we can measure some specific best practice environments when gathering stats.

Graphing some imperfect data

Koichi Sasada , Aman Gupta and others have been tirelessly working to improve the Ruby GC in the last few months.

I ran the Discourse bench and few other tests on 15 or so builds. Data is imperfect and needs to be reproduced, that said this is a starting point.

Here we can see a graph showing how the "long living" object count has reduced from build to build, somewhere between the end of November and December there was a huge decrease.

Here I am graphing the median time for a homepage request on Discourse over a bunch of builds

There are two interesting jumps, the first was a big slowdown when the RGenGC was introduced. Later in mid November we recovered the performance but it has regressed since.

Here it is clear to see the massive improvement the generational GC provided to the 75th percentile. Ruby 2.1 is going to be a massive improvement for those not doing any GC tuning.

Similarly Rails boot is much improved.

What we need?

A long term benchmarking project will take quite a while to build, personally I can not afford to dedicate more that a few hours per week.

Foremost, we need people. Developers to build a UI, integrate existing tools and add benchmarks. Designers to make a nice site.

Some contributions to related projects can heavily improve the "benchmarking" experience, faster builds and faster gem installs would make a tremendous difference.

The project can be split quite cleanly into 2 sub-projects. First is information gathering, writing the scripts needed to collect all the historical data into a database of sorts. The second part is a web UI to present the results potentially reusing or extending https://github.com/tobami/codespeed .

More hardware and a nice domain name would also be awesome.

If you want to work on this, help build UI or frameworks for running long term benchmarks, contact me (either here or at sam.saffron<at>gmail.com)

Building a long term benchmark is critical for longevity and health of Ruby and Rails. I really hope we can get started on this soon. I wish I had more time to invest in this.

Vintage JavaScript begone

$
0
0

The Problem

These days all the cool kids are using Ember.JS or Angular or Meteor or some other single page web application. If you deploy often, like we do at Discourse, you have a problem.

How can you get everyone to run the latest version of your JavaScript and CSS bundle? Since people do not reload full pages and just navigate around accumulating small json payload there is a strong possibility people can be on old versions and experience weird and wonderful odd bugs.

The message bus

One BIG criticism I have heard of Rails and Django lately is the lack of "realtime" support. This is an issue we foresaw over a year ago at Discourse.

Traditionally, people add more component to a Rails system to support "realtime" based notifications. Be it Ruby built systems like faye, non Ruby systems like Node.JS with socket.io or outsourced systems like Pusher. For Discourse none of these were an option. We could not afford to complicate the setup process or outsource this stuff to a third party.

I built the message_bus gem to provide us with an engine for realtime updates:

https://github.com/SamSaffron/message_bus

At the core of it message_bus allows you a very simple API to publish and subscribe to messages on the client:

# in ruby
MessageBus.publish('/my_channel', 'hello')

<!-- client side -->
<script src="message-bus.js" type="text/javascript"></script>
<script>
MessageBus.subscribe('/my_channel', function(data){
  alert(data);
});
</script>

Behind this trivial API hides a fairly huge amount of feature goodness:

  1. This thing scales really well, clients "pull" information from a reliable pub sub channel, minimal per-client house keeping.
  2. Built in security (send messages to user or groups only)
  3. Built on rack hijack and thin async, so we support passenger, thin, unicorn and puma.
  4. Uses long polling, with an event machine event loop, can easily service thousands of clients from a single web server.
  5. Built in multi-site support (for hosting sub domains)

We use this system at Discourse to notify you interesting things, update the topic pages live and so on.

On to the implementation

Given a message_bus, implementing a system for updating assets is fairly straight forward.

On the Rails side we calculate a digest that represents our application version.

def assets_digest
  @assets_digest ||= Digest::MD5.hexdigest(
         ActionView::Base.assets_manifest.assets.values.sort.join
   )
end

def notify_clients_if_needed
   # global channel is special, it goes to all sites 
   channel = "/global/asset-version"
   message = MessageBus.last_message(channel)

   unless message && message.data == digest
      MessageBus.publish channel, digest
   end
end

With every full page we deliver to the clients we include this magic digest:

Discourse.set('assetVersion','<%= Discourse.assets_digest %>');

Then on the client side we listen for version changes:

Discourse.MessageBus.subscribe("/global/asset-version", function(version){
  Discourse.set("assetVersion",version);

  if(Discourse.get("requiresRefresh")) {
    // since we can do this transparently for people browsing the forum
    //  hold back the message a couple of hours
    setTimeout(function() {
      bootbox.confirm(I18n.lookup("assets_changed_confirm"), function(){
        document.location.reload();
      });
    }, 1000 * 60 * 120);
  }
});

Finally, we hook into the transition to new routes to force a refresh if we detected assets have changed:

routeTo: function(path) {

  if(Discourse.get("requiresRefresh")){
    document.location.href = path;
    return;
   }
 //...
}

Since in the majority of spots we render "full links" and pass them through this magic method this works well. Recently Robin added a second mechanism that allows us to trap every transition, however it would require a double load which I wanted to avoid:

Eg, the following would also work

Discourse.PageTracker.current().on('change', function() {
   if(Discourse.get("requiresRefresh")){
    document.location.reload();
    return;
   }
});

Summary

I agree that every single page application needs some sort of messaging bus, you can have one today, on Rails, if you start using the message_bus.

Real-time is not holding us back with Rails, it is production ready.

Ruby 2.1 Garbage Collection: ready for production

$
0
0

The article "Ruby Garbage Collection: Still Not Ready for Production" has been making the rounds.

In it we learned that our GC algorithm is flawed and were prescribed some rather drastic and dangerous workarounds.

At the core it had one big demonstration:

Run this on Ruby 2.1.1 and you will be out of memory soon:

while true
  "a" * (1024 ** 2)
end

Malloc limits, Ruby and you

From very early versions of Ruby we always tracked memory allocation. This is why I found FUD comments such as this troubling:

the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory

This is clearly misunderstanding Ruby. In fact, the aforementioned article does nothing to mention memory allocation may trigger a GC.

Historically Ruby was quite conservative issuing GCs based on the amount of memory allocated. Ruby keeps track of all memory allocated (using malloc) outside of the Ruby heaps between GCs. In Ruby 2.0, out-of-the-box every 8MB of allocations will result in a full GC. This number is way too small for almost any Rails app, which is why increasing RUBY_GC_MALLOC_LIMIT is one of the most cargo culted settings out there in the wild.

Matz picked this tiny number years ago when it was a reasonable default, however it was not revised till Ruby 2.1 landed.

For Ruby 2.1 Koichi decided to revamp this sub-system. The goal was to have defaults that work well for both scripts and web apps.

Instead of having a single malloc limit for our app, we now have a starting point malloc limit that will dynamically grow every time we trigger a GC by exceeding the limit. To stop unbound growth of the limit we have max values set.

We track memory allocations from 2 points in time:

  • memory allocated outside Ruby heaps since last minor GC
  • memory allocated since last major GC.

At any point in time we can get a snapshot of the current situation with GC.stat:

> GC.stat
=> {:count=>25,
 :heap_used=>263,
 :heap_length=>406,
 :heap_increment=>143,
 :heap_live_slot=>106806,
 :heap_free_slot=>398,
 :heap_final_slot=>0,
 :heap_swept_slot=>25258,
 :heap_eden_page_length=>263,
 :heap_tomb_page_length=>0,
 :total_allocated_object=>620998,
 :total_freed_object=>514192,
 :malloc_increase=>1572992,
 :malloc_limit=>16777216,
 :minor_gc_count=>21,
 :major_gc_count=>4,
 :remembered_shady_object=>1233,
 :remembered_shady_object_limit=>1694,
 :old_object=>65229,
 :old_object_limit=>93260,
 :oldmalloc_increase=>2298872,
 :oldmalloc_limit=>16777216}

malloc_increase denotes the amount of memory we allocated since the last minor GC. oldmalloc_increase the amount since last major GC.

We can tune our settings, from "Ruby 2.1 Out-of-Band GC":

RUBY_GC_MALLOC_LIMIT: (default: 16MB)
RUBY_GC_MALLOC_LIMIT_MAX: (default: 32MB)
RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR: (default: 1.4x)

and

RUBY_GC_OLDMALLOC_LIMIT: (default: 16MB)
RUBY_GC_OLDMALLOC_LIMIT_MAX: (default: 128MB)
RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR: (default: 1.2x)

So, in theory, this unbound memory growth is not possible for the script above. The two MAX values should just cap the growth and force GCs.

However, this is not the case in Ruby 2.1.1

Investigating the issue

We spent a lot of time ensuring we had extensive instrumentation built in to Ruby 2.1, we added memory profiling hooks, we added GC hooks, we exposed a large amount of internal information. This has certainly paid off.

Analyzing the issue raised by this mini script is trivial using the gc_tracer gem. This gem allows us to get a very detailed snapshot of the system every time a GC is triggered and store it in a text file, easily consumable by spreadsheet.

We simply add this to the rogue script:

require 'gc_tracer'
GC::Tracer.start_logging("log.txt")

And get a very detailed trace back in the text file:

In the snippet above we can see minor GCs being triggered by exceeding malloc limits (where major_by is 0) and major GCs being triggered by exceeding malloc limits. We can see out malloc limit and old malloc limit growing. We can see when GC starts and ends, and lots more.

Trouble is, our limit max for both oldmalloc and malloc grows well beyond the max values we have defined:

So, bottom line is, looks like we have a straight out bug.

https://bugs.ruby-lang.org/issues/9687

I one line bug, that will be patched in Ruby 2.1.2 and is already fixed in master.

Are you affected by this bug?

It is possible your production app on Ruby 2.1.1 is impacted by this. Simplest way to find out is to issue a GC.stat as soon as memory usage is really high.

The script above is very aggressive and triggers the pathological issue, it is quite possibly you are not even pushing against malloc limits. Only way to find out is measure.

General memory growth under Ruby 2.1.1

A more complicated issue we need to tackle is the more common "memory doubling" issue under Ruby 2.1.1. The general complaint goes something along the line of "I just upgraded Ruby and now my RSS has doubled"

This issue is described in details here: https://bugs.ruby-lang.org/issues/9607

Memory usage growth is partly unavoidable when employing a generational GC. A certain section of the heap is getting scanned far less often. It's a performance/memory trade-off. That said, the algorithm used in 2.1 is a bit too simplistic.

If ever an objects survives a minor GC it will be flagged as oldgen, these objects will only be scanned during a major GC. This algorithm is particularly problematic for web applications.

Web applications perform a large amount of "medium" lived memory allocations. A large number of objects are needed for the lifetime of a web request. If a minor GC hits in the middle of a web request we will "promote" a bunch of objects to the "long lived" oldgen even though they will no longer be needed at the end of the request.

This has a few bad side effects,

  1. It forces major GC to run more often (growth of oldgen is a trigger for running a major GC)
  2. It forces the oldgen heaps to grow beyond what we need.
  3. A bunch of memory is retained when it is clearly not needed.

.NET and Java employ 3 generations to overcome this issue. Survivors in Gen 0 collections are promoted to Gen 1 and so on.

Koichi is planning on refining the current algorithm to employ a somewhat similar technique of deferred promotion. Instead of promoting objects to oldgen on first minor GC and object will have to survive two minor GCs to be promoted. This means that if no more than 1 minor GC runs during a request our heaps will be able to stay at optimal sizes. This work is already prototyped into Ruby 2.1 see RGENGC_THREEGEN in gc.c (note, the name is likely to change). This is slotted to be released in Ruby 2.2

We can see this problem in action using this somewhat simplistic test:

@retained = []
@rand = Random.new(999)

MAX_STRING_SIZE = 100

def stress(allocate_count, retain_count, chunk_size)
  chunk = []
  while retain_count > 0 || allocate_count > 0
    if retain_count == 0 || (@rand.rand < 0.5 && allocate_count > 0)
      chunk << " " * (@rand.rand * MAX_STRING_SIZE).to_i
      allocate_count -= 1
      if chunk.length > chunk_size
        chunk = []
      end
    else
      @retained << " " * (@rand.rand * MAX_STRING_SIZE).to_i
      retain_count -= 1
    end
  end
end

start = Time.now
# simulate rails boot, 2M objects allocated 600K retained in memory
stress(2_000_000, 600_000, 200_000)

# simulate 100 requests that allocate 100K objects
stress(10_000_000, 0, 100_000)


puts "Duration: #{(Time.now - start).to_f}"

puts "RSS: #{`ps -eo rss,pid | grep #{Process.pid} | grep -v grep | awk '{ print $1;  }'`}"

In Ruby 2.0 we get:

% ruby stress.rb
Duration: 10.074556277
RSS: 122784

In Ruby 2.1.1 we get:

% ruby stress.rb
Duration: 7.031792076
RSS: 236244

Performance has improved, but memory almost doubled.

To mitigate the current pain point we can use the new RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR environment var.

Out of the box we trigger a major gc if our oldobject count doubles. We can tune this down to say 1.3 times and see a significant improvement memory wise:

% RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.3 ruby stress.rb
Duration: 6.85115156
RSS: 184928

On memory constrained machines we can go even further and disable generational GC altogether.

% RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb 
Duration: 6.759709765
RSS: 149728

We can always add jemalloc for good measure to shave off an extra 10% percent or so:

LD_PRELOAD=/home/sam/Source/jemalloc-3.5.0/lib/libjemalloc.so RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb 
Duration: 6.204024629
RSS: 144440

If that is still not enough you can push malloc limits down (and have more GCs run due to hitting it)

% RUBY_GC_MALLOC_LIMIT_MAX=8000000 RUBY_GC_OLDMALLOC_LIMIT_MAX=8000000  LD_PRELOAD=/home/sam/Source/jemalloc-3.5.0/lib/libjemalloc.so RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb
Duration: 9.02354988
RSS: 120668

Which is nice since we are back to Ruby 2.0 numbers now and lost a pile of performance.

Ruby 2.1 is ready for production

Ruby 2.1 has been running in production at GitHub for a few months with great success. The 2.1.0 release was a little rough 2.1.1 addresses the majority of the big issues it had. 2.1.2 will address the malloc issue, which may or may not affect you.

If you are considering deploying Ruby 2.1 I would strongly urge giving GitHub Ruby a go since it contains a fairly drastic performance boost due to funny-falcons excellent method cache patch.

Performance has much improved at the cost of memory, that said you can tune memory as needed and measure impact of various settings effectively.

Summary

  • If you discover any issues, please report them on https://bugs.ruby-lang.org/
  • Use Ruby 2.1.1 in production, upgrade to 2.1.2 as soon as it is released
  • Be sure to look at jemalloc and GC tuning for memory constrained systems. See also: https://bugs.ruby-lang.org/issues/9113
  • Always be measuring. If you are seeing issues run GC.stat, you can attach to the rogue process using rbtrace a gem you should consider including on production systems.

Resources:

Speeding up Rails 4.2

$
0
0

Recently Godfrey Chan got Discourse working on Rails Master. It was a rather long task that involved some changes to Discourse internals and some changes to Rails internals.

Knowing Rails 4.2 was just around the corner I decided that it seemed like the perfect time to see how performance is. Seeing Rails 4.2 contains the adequate record patches, I was super eager to see how Discourse faired.

The Discourse benchmark

Answering the question of "how fast is Ruby?" or "how fast is Rails?" is something people usually answer with the following set of steps.

  1. Upgrade production to a certain version of Rails
  2. Look at New Relic reports
  3. Depending on how performance is either complain on Twitter and/or Blog or write a few kind words.

The trouble is that performance testing after release is the absolute worst time. Code is in production, fixing stuff is tricky, rolling back often impractical.

At Discourse I developed the Discourse benchmark. It is a simple script that loads up a few thousand users and topics and then proceeds to measure performance of various hot spots using apache bench.

I have found that the results of the benchmark are strikingly similar to real-world performance we see. If the benchmark is faster, it is very likely that it will also be faster in production. We use this benchmark to test Ruby before major Ruby releases. Koichi used this benchmark to help optimise the current GC algorithms in MRI.

It is often tempting to look at micro benchmarks when working on performance issues. Micro benchmarks are a great tool, but MUST be followed with bigger macro benchmarks to see the real impact. 1000% speedup for a routine that is called once a day in a background job has significantly less impact than 0.5% improvement to a routing that is called on every web request.

How was Rails 4.2 Looking?

For over a year now, Discourse had the ability to dual boot. This allowed me to quickly run a simple benchmark to see where we were at (arel , rails):

% RAILS_MASTER=1 ruby script/bench.rb

name: 
  [percentile]: [time]

categories_admin:
  50: 123
  75: 136
  90: 147
  99: 224
home_admin:
  50: 111
  75: 122
  90: 147
  99: 224
topic_admin:
  50: 69
  75: 81
  90: 89
  99: 185
categories:
  50: 81
  75: 126
  90: 138
  99: 211
home:
  50: 53
  75: 63
  90: 100
  99: 187
topic:
  50: 20
  75: 21
  90: 23
  99: 77
timings:
  load_rails: 3052
ruby-version: 2.1.2-p95
rss_kb: 241604

VS.

categories_admin:
  50: 62
  75: 66
  90: 77
  99: 193
home_admin:
  50: 51
  75: 53
  90: 55
  99: 175
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 87
categories:
  50: 53
  75: 55
  90: 98
  99: 173
home:
  50: 35
  75: 37
  90: 55
  99: 154
topic:
  50: 12
  75: 12
  90: 13
  99: 66
timings:
  load_rails: 3047
ruby-version: 2.1.2-p95
rss_kb: 263948

Across the board we were running at around half speed. Pages that were taking 12ms today were taking 20ms on Rails 4.2.

Given these results I decided to take some time off my regular work and try to isolate what went wrong. Contribute a few patches and help the Rails team correct the issue prior to the impending RC release.

Before I continue I think it is worth mentioning that the lion's share of the performance optimisations we ended up adding were authored by Sean Griffin so a big huge thanks to him.

Cracking open the black box

I find flame graphs are instrumental at quickly discovering what is going on. Flame graphs in Ruby 2.1 now have better fidelity than they did in 2.0 thanks to feeding directly off stackprof.

My methodology is quite simple.

  1. Go to a page in Rails 4.1.8 mode, reload a few times, then tack on ?pp=flamegraph to see a flamegraph
  2. Repeat the process on Rails 4.2
  3. Zoom in on similar area and look for performance regressions.

For example here is the before graph on Rails 4.1.8 (home page)

rails_baseline.html (214.9 KB)

Compared to the after graph on Rails Master (home page)

rails_master_v1.html (253.6 KB)

The first point of interest is the sample count:

Rails 4.1.8

vs.

Master

Flamgraph attempts to grab a sample every 0.5ms of wall time, the reality is that due to depth of stack, timers and so on you end up getting a full sample every 0.5ms - 2ms. That said we can use the sample counts a good rough estimate of execution time.

It looks like Rails master is taking up 108 samples where current Rails was taking 74, a 46% increase in the above example.

We can see a huge amount of the additional overhead is spent in arel.

Next we zoom in to particular similar areas and compare:

Rails 4.1.8

Master

We can see expensive calls to reorder_bind_params that we were not seeing in Rails 4.2

Clicking on the frame reveals:

So 14.5% of our cost is in this method.

Additionally we see other new areas of overhead:

Like a method missing being invoked when casting DB values

We repeat the process for various pages in our app, and discover that 3.2% of the time on the categories page is spent in ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key and a pile of other little cuts.

With this information we can start working on various fixes:

There are a lot of little cuts and some big cuts, here are some examples

The list goes on, slowly but surely we regain performance.

Memory profiling

Another very important tool in our arsenal is memory profiler. If we include it in our Gemfile, rack-mini-profiler will wire it up for us, then appending pp=profile-gc-ruby-head to any page gives us a memory report (note this requires Ruby 2.1 as it provides the new APIs needed):

It reveals huge allocation increase in Rails master, which was allocating 131314 objects on the front page in Rails master vs 17651 object in Rails 4.1.8

We can dig down and discover a huge addition of allocations in arel and activerecord

arel/lib x 107607
activerecord/lib x 12778

We can see the vast majority is being allocated at:

arel/lib/arel/visitors/visitor.rb x 106435

We can track down the 3 particular lines:

arel/lib/arel/visitors/visitor.rb:11 x 74432
lib/arel/visitors/visitor.rb:13 x 22880
lib/arel/visitors/visitor.rb:12 x 7040

This kind of analysis helps us fix issues like:

and

It is very important to be able to prove you have a real issue before applying optimisations blindly. Memory profiler allows us to pinpoint specifics and demonstrate the actual gain.

It is fascinating to see which particular strings are allocated where. For example the empty string is allocated 702 times during a front page load:

"" x 702
    activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb:433 x 263
    activerecord/lib/active_record/type/boolean.rb:11 x 153
    omniauth-1.2.2/lib/omniauth/strategy.rb:396 x 84

We can work through this information, and paper cut by paper cut eliminate a large number of allocations resulting in better performance.

Where are we with Rails 4.2 RC1

categories_admin:
  50: 69
  75: 74
  90: 86
  99: 165
home_admin:
  50: 55
  75: 57
  90: 66
  99: 133
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 108
categories:
  50: 56
  75: 58
  90: 60
  99: 192
home:
  50: 39
  75: 40
  90: 42
  99: 165
topic:
  50: 13
  75: 13
  90: 14
  99: 75

So it looks like we are more or less 10% slower that 4.1 which is a huge improvement over where we were a couple of weeks ago (considering we were almost 100% slower).

There is still plenty of work left to do, but we are on the right path.

A taste of things to come

Lars Kanis a maintainer of the pg gem has been working for a while now on a system for natively casting values in the pg gem. Up till now the pg gem has followed libpq quite closely. When you query the DB using libpq you are always provided with strings. Historically the pg gem just echos it back to the Ruby program.

So for example:

 ActiveRecord::Base.connection.raw_connection.exec("select 1").values[0][0]
=> "1"

The new pg release allows us to get back a number here, as opposed to a string. This helps us avoid allocating, converting and garbage collecting a string, just to get a number from our database. The cost of these conversions adds up quite a lot. It is much cheaper to do conversions in C code and avoid the allocation of objects that the Ruby GC needs to track.

cnn = ActiveRecord::Base.connection.raw_connection; puts
cnn.type_map_for_results = PG::BasicTypeMapForResults.new cnn; puts
cnn.exec('select 1').values[0][0]
=> 1

The tricky thing though is that people heavily over-query when using Active Record.

# no need to perform conversions for created_at and updated_at
car_color = Car.first.color

# above is more efficiently written as car_color = Car.limit(1).pluck(:color).first

If you perform conversions upfront for columns people never consume you can end up with a severe performance hit. In particular converting times is a very expensive operation which involves string parsing.

In my benchmarking it is showing a 5-10% improvement for the Discourse bench which in almost all cases means we are faster than Rails 4.1.

I think it is fairly likely that we can be 10-20% faster than Rails 4.1 by the time Rails 5 ships provided we watch the benchmarks closely and keep optimising.

Side Effects

A very interesting and rewarding side effect of this work was a series of optimisations I was able to apply to Discourse while looking at all the data.

In particular I built us a distributed cache powered by the message_bus gem. A Hash like structure that is usable from any process running Discourse (Discourse instances can span multiple machines). Using it I was able to apply caching to 3 places that we sorely needed to and avoid expensive database work.

Conclusion

The recent performance work strongly highlights the importance of a long running benchmark. There is no single magical method we can fix that will recover a year of various performance regressions. The earlier we pick up on regressions the easier they are to fix.

It also highlight the "paper cut" nature of this kind of work. It is easy to be flippant and say that 1% here and 0.5% percent there does not matter. However, repeat this process enough and you have a product that is 100% slower. We need to look at trends to see where we are at. Slow down dev and work on performance when performance regresses.

The Rails team did an admirable job at addressing a lot of the perf regressions we found in very short order. Sean did some truly impressive work. Godfrey did a spectacular job getting Discourse running on Rails master.

However, I am very sad that we do not have long running benchmark for either Ruby or Rails and worry that if we don't we are bound to repeat this story again.

Announcing rubybench.org

$
0
0

A year ago I put out a call to action for long running benchmarks.

I got a great response and lots of people lended a hand trying to get a system built. The project itself was quite slow moving, we had some working bits, but nothing really complete to show.

A few months ago Alan Guo Xiang Tan contacted me about the project. In the past few months Alan has been working tirelessly on the project.

Alan rebooted the project and built a Docker backend for running our performance tests and web frontend.

The first UI tracks performance across stable releases of Ruby. The goal is to provide you with a good gauge on what will happen when you upgrade Ruby.

  • Will it be faster?
  • Will it consume more or less memory?

The second UI tracks Ruby's progress through the various commits.

  • Did a particular commit improve/impact performance?
  • How is memory doing?

Rubybench is already helping ruby

In the very short amount of time rubybench was responsible for isolating a regression in the URI library. Our goal is for it to become an tool that the Ruby core team loves using. A tool that automatically finds regression and allows the core team to feel safer when making performance related changes.

Call for help

We still need lots of help with the project:

  • Design help with home page and logo.
  • Coding help with both the web and the runner.
  • Testing, to confirm all our testing methodology is correct and results reproducible.
  • Come to our forum and make some suggestions or ask some question.

Call for sponsors

We are in desperate need for a sponsor. We can not run rubybench without a bare metal server. We started a topic to discuss our requirements.

Contact me at sam.saffron at gmail.com, if you would like to help out!

FAQ

What about JRuby / Rubinius and other implementations?

For our initial release we focused on MRI, however if we have enough hardware and help we would love to include other implementations. We would also love to work with various implementers to ensure the bench suite is both accurate and reproducible.

Why all the focus on Docker

A guiding goal for the rubybench project is to have repeatable easy to generate test results. By using Docker we are able to ensure all our tests are repeatable.

Will you run benchmarks on cloud hosting services

We will not publish CPU results gathered on virtual hosts where we can not control our CPU allocation. The only results published on rubybench run on production level bare metal servers.


I am very excited about this project and hope it will help us all have a more awesome and faster Ruby!

A huge thank you goes to Alan for bootstrapping this project and getting it to a publishable state!

Debugging memory leaks in Ruby

$
0
0

At some point in the life of every Rails developer you are bound to hit a memory leak. It may be tiny amount of constant memory growth, or a spurt of growth that hits you on the job queue when certain jobs run.

Sadly, most Ruby devs out there simply employ monit , inspeqtor or unicorn worker killers. This allows you to move along and do more important work, tidily sweeping the problem snugly under the carpet.

Unfortunately, this approach leads to quite a few bad side effects. Besides performance pains, instability and larger memory requirements it also leaves the community with a general lack of confidence in Ruby. Monitoring and restarting processes is a very important tool for your arsenal, but at best it is a stopgap and safeguard. It is not a solution.

We have some amazing tools out there for attacking and resolving memory leaks, especially the easy ones - managed memory leaks.

Are we leaking ?

The first and most important step for dealing with memory problems is graphing memory over time. At Discourse we use a combo of Graphite , statsd and Grafana to graph application metrics.

A while back I packaged up a Docker image for this work which is pretty close to what we are running today. If rolling your own is not your thing you could look at New Relic, Datadog or any other cloud based metric provider. The key metric you first need to track is RSS for you key Ruby processes. At Discourse we look at max RSS for Unicorn our web server and Sidekiq our job queue.

Discourse is deployed on multiple machines in multiple Docker containers. We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker. It uses docker exec to get all sorts of information about the processes running inside the container.
Note: Discourse uses the unicorn master process to launch multiple workers and job queues, it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world.

With this information at hand we can easily graph RSS for any container on any machine and look at trends:

Long term graphs are critical to all memory leak analysis. They allow us to see when an issue started. They allow us to see the rate of memory growth and shape of memory growth. Is it erratic? Is it correlated to a job running?

When dealing with memory leaks in c-extensions, having this information is critical. Isolating c-extension memory leaks often involves valgrind and custom compiled versions of Ruby that support debugging with valgrind. It is tremendously hard work we only want to deal with as a last resort. It is much simpler to isolate that a trend started after upgrading EventMachine to version 1.0.5.

Managed memory leaks

Unlike unmanaged memory leaks, tackling managed leaks is very straight forward. The new tooling in Ruby 2.1 and up makes debugging these leaks a breeze.

Prior to Ruby 2.1 the best we could do was crawl our object space, grab a snapshot, wait a bit, grab a second snapshot and compare. I have a basic implementation of this shipped with Discourse in MemoryDiagnostics, it is rather tricky to get this to work right. You have to fork your process when gathering the snapshots so you do not interfere with your process and the information you can glean is fairly basic. We can tell certain objects leaked, but can not tell where they were allocated:

3377 objects have leaked
Summary:
String: 1703
Bignum: 1674

Sample Items:
Bignum: 40 bytes
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String:

If we were lucky we would have leaked a number or string that is revealing and that would be enough to nail it down.

Additionally we have GC.stat that could tell us how many live objects we have and so on.

The information was very limited. We can tell we have a leak quite clearly, but finding the reason can be very difficult.

Note: a very interesting metric to graph is GC.stat[:heap_live_slots] with that information at hand we can easily tell if we have a managed object leak.

Managed heap dumping

Ruby 2.1 introduced heap dumping, if you also enable allocation tracing you have some tremendously interesting information.

The process for collecting a useful heap dump is quite simple:

Turn on allocation tracing

require 'objspace'
ObjectSpace.trace_object_allocations_start

This will slow down your process significantly and cause your process to consume more memory. However, it is key to collecting good information and can be turned off later. For my analysis I will usually run it directly after boot.

When debugging my latest round memory issues with SideKiq at Discourse I deployed an extra docker image to a spare server. This allowed me extreme freedom without impacting SLA.

Next, play the waiting game

After memory has clearly leaked, (you can look at GC.stat or simply watch RSS to determine this happened) run:

io=File.open("/tmp/my_dump", "w")
ObjectSpace.dump_all(output: io); 
io.close

Running Ruby in an already running process

To make this process work we need to run Ruby inside a process that has already started.

Luckily, the rbtrace gem allows us to do that (and much more). Moreover it is safe to run in production.

We can easily force SideKiq to dump a its heap with:

bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

rbtrace runs in a restricted context, a nifty trick is breaking out of the trap context with Thread.new

We can also crack information out of the box with rbtrace, for example:

bundle exec rbtrace -p 6744 -e 'GC.stat'
/usr/local/bin/ruby: warning: RUBY_HEAP_MIN_SLOTS is obsolete. Use RUBY_GC_HEAP_INIT_SLOTS instead.
*** attached to process 6744
>> GC.stat
=> {:count=>49, :heap_allocated_pages=>1960, :heap_sorted_length=>1960, :heap_allocatable_pages=>12, :heap_available_slots=>798894, :heap_live_slots=>591531, :heap_free_slots=>207363, :heap_final_slots=>0, :heap_marked_slots=>335775, :heap_swept_slots=>463124, :heap_eden_pages=>1948, :heap_tomb_pages=>12, :total_allocated_pages=>1960, :total_freed_pages=>0, :total_allocated_objects=>13783631, :total_freed_objects=>13192100, :malloc_increase_bytes=>32568600, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>41, :major_gc_count=>8, :remembered_wb_unprotected_objects=>12175, :remembered_wb_unprotected_objects_limit=>23418, :old_objects=>309750, :old_objects_limit=>618416, :oldmalloc_increase_bytes=>32783288, :oldmalloc_increase_bytes_limit=>44484250}
*** detached from process 6744

Analyzing the heap dump

With a rich heap dump at hand we can start analysis, a first report to run is looking at the count of object per GC generation.

When trace object allocation is enabled the runtime will attach rich information next to all allocations. For each object that is allocated while tracing is on we have:

  1. The GC generation it was allocated in
  2. The filename and line number it was allocated in
  3. A truncated value
  4. bytesize
  5. ... and much more

The file is in json format and can easily be parsed line by line. EG:

{"address":"0x7ffc567fbf98", "type":"STRING", "class":"0x7ffc565c4ea0", "frozen":true, "embedded":true, "fstring":true, "bytesize":18, "value":"ensure in dispatch", "file":"/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.9/lib/active_support/dependencies.rb", "line":247, "method":"require", "generation":7, "memsize":40, "flags":{"wb_protected":true, "old":true, "long_lived":true, "marked":true}}

A simple report that shows how many objects are left around from which GC generation is a great start for looking at memory leaks. Its a timeline of object leakage.

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          data << (parsed=JSON.parse(line))
        end
    end

    data.group_by{|row| row["generation"]}
        .sort{|a,b| a[0].to_i <=> b[0].to_i}
        .each do |k,v|
          puts "generation #{k} objects #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

For example this is what I started with:

generation  objects 334181
generation 7 objects 6629
generation 8 objects 38383
generation 9 objects 2220
generation 10 objects 208
generation 11 objects 110
generation 12 objects 489
generation 13 objects 505
generation 14 objects 1297
generation 15 objects 638
generation 16 objects 748
generation 17 objects 1023
generation 18 objects 805
generation 19 objects 407
generation 20 objects 126
generation 21 objects 1708
generation 22 objects 369
...

We expect a large number of objects to be retained after boot and sporadically when requiring new dependencies. However we do not expect a consistent amount of objects to be allocated and never cleaned up. So let's zoom into a particular generation:

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          parsed=JSON.parse(line)
          data << parsed if parsed["generation"] == 18
        end
    end
    data.group_by{|row| "#{row["file"]}:#{row["line"]}"}
        .sort{|a,b| b[1].count <=> a[1].count}
        .each do |k,v|
          puts "#{k} * #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

generation 19 objects 407
/usr/local/lib/ruby/2.2.0/weakref.rb:87 * 144
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21 * 72
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42 * 72
/var/www/discourse/lib/freedom_patches/translate_accelerator.rb:65 * 15
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/i18n-0.7.0/lib/i18n/interpolate/ruby.rb:21 * 15
/var/www/discourse/lib/email/message_builder.rb:85 * 9
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/actionview-4.1.9/lib/action_view/template.rb:297 * 6
/var/www/discourse/lib/email/message_builder.rb:36 * 6
/var/www/discourse/lib/email/message_builder.rb:89 * 6
/var/www/discourse/lib/email/message_builder.rb:46 * 6
/var/www/discourse/lib/email/message_builder.rb:66 * 6
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activerecord-4.1.9/lib/active_record/connection_adapters/postgresql_adapter.rb:515 * 5

Further more, we can chase down the reference path to see who is holding the references to the various objects and rebuild object graphs.

The first thing I attacked in this particular case was code I wrote, which is a monkey patch to Rails localization.

Why we monkey patch rails localization?

At Discourse we patch the Rails localization subsystem for 2 reasons:

  1. Early on we discovered it was slow and needed better performance
  2. Recently we started accumulating an enormous amount of translations and needed to ensure we only load translations on demand to keep memory usage lower. (this saves us 20MB of RSS)

Consider the following bench:

ENV['RAILS_ENV'] = 'production'
require 'benchmark/ips'

require File.expand_path("../../config/environment", __FILE__)

Benchmark.ips do |b|
  b.report do |times|
    i = -1
    I18n.t('posts') while (i+=1) < times
  end
end

Before monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                         4.518k i/100ms
-------------------------------------------------
                        121.230k (±11.0%) i/s -    600.894k

After monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                        22.509k i/100ms
-------------------------------------------------
                        464.295k (±10.4%) i/s -      2.296M

So our localization system is running almost 4 times faster, but ... it is leaking memory.

Reviewing the code I could see that the offending line is:

Turns out, we were sending a hash in from the email message builder that includes ActiveRecord objects, this later was used as a key for the cache, the cache was allowing for 2000 entries. Considering that each entry could involve a large number of AR objects memory leakage was very high.

To mitigate, I changed the keying strategy, shrunk the cache and completely bypassed it for complex localizations:

One day later when looking at memory graphs we can easily see the impact of this change:

This clearly did not stop the leak but it definitely slowed it down.

therubyracer is leaking

At the top of our list we can see our JavaScript engine therubyracer is leaking lots of objects, in particular we can see the weak references it uses to maintain Ruby to JavaScript mappings are being kept around for way too long.

To maintain performance at Discourse we keep a JavaScript engine context around for turning Markdown into HTML. This engine is rather expensive to boot up, so we keep it in memory plugging in new variables as we bake posts.

Since our code is fairly isolated a repro is trivial, first let's see how many objects we are leaking using the memory_profiler gem:

ENV['RAILS_ENV'] = 'production'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

# warmup
3.times{PrettyText.cook("hello world")}

MemoryProfiler.report do
  50.times{PrettyText.cook("hello world")}
end.pretty_print

At the top of our report we see:

retained objects by location
-----------------------------------
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/2.1.0/weakref.rb:87
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21
       600  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42
       250  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/context.rb:97
        50  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/object.rb:8

So we are leaking 54 or so objects each time we cook a post, this adds up really fast. We may also be leaking unmanaged memory here which could compound the problem.

Since we have a line number it is easy to track down the source of the leak

require 'weakref'
class Ref
  def initialize(object)
    @ref = ::WeakRef.new(object)
  end
  def object
    @ref.__getobj__
  rescue ::WeakRef::RefError
    nil
  end
end

class WeakValueMap
   def initialize
      @values = {}
   end

   def [](key)
      if ref = @values[key]
        ref.object
      end
   end

   def []=(key, value)
     @values[key] = V8::Weak::Ref.new(value)
   end
end

This WeakValueMap object keeps growing forever and nothing seems to be clearing values from it. The intention with the WeakRef usage was to ensure we allow these objects to be released when no-one holds references. Trouble is that the reference to the wrapper is now kept around for the entire lifetime of the JavaScript context.

A fix is pretty straight forward:

class WeakValueMap
  def initialize
    @values = {}
  end

  def [](key)
    if ref = @values[key]
      ref.object
    end
  end

  def []=(key, value)
    ref = V8::Weak::Ref.new(value)
    ObjectSpace.define_finalizer(value, self.class.ensure_cleanup(@values, key, ref))

    @values[key] = ref
  end

  def self.ensure_cleanup(values,key,ref)
    proc {
      values.delete(key) if values[key] == ref
    }
  end
end

We define a finalizer on the wrapped object that ensures we clean up the all these wrapping objects and keep on WeakValueMap small.

The effect is staggering:

ENV['RAILS_ENV'] = 'production'
require 'objspace'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

def rss
 `ps -eo pid,rss | grep #{Process.pid} | awk '{print $2}'`.to_i
end

PrettyText.cook("hello world")

# MemoryProfiler has a helper that runs the GC multiple times to make
#  sure all objects that can be freed are freed.
MemoryProfiler::Helpers.full_gc
puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

20.times do

  5000.times { |i|
    PrettyText.cook("hello world")
  }
  MemoryProfiler::Helpers.full_gc
  puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

end

Before

rss: 137660 live objects 306775
rss: 259888 live objects 570055
rss: 301944 live objects 798467
rss: 332612 live objects 1052167
rss: 349328 live objects 1268447
rss: 411184 live objects 1494003
rss: 454588 live objects 1734071
rss: 451648 live objects 1976027
rss: 467364 live objects 2197295
rss: 536948 live objects 2448667
rss: 600696 live objects 2677959
rss: 613720 live objects 2891671
rss: 622716 live objects 3140339
rss: 624032 live objects 3368979
rss: 640780 live objects 3596883
rss: 641928 live objects 3820451
rss: 640112 live objects 4035747
rss: 722712 live objects 4278779
/home/sam/Source/discourse/lib/pretty_text.rb:185:in `block in markdown': Script Timed Out (PrettyText::JavaScriptError)
	from /home/sam/Source/discourse/lib/pretty_text.rb:350:in `block in protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `synchronize'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:161:in `markdown'
	from /home/sam/Source/discourse/lib/pretty_text.rb:218:in `cook'
	from tmp/mem_leak.rb:30:in `block (2 levels) in <main>'
	from tmp/mem_leak.rb:29:in `times'
	from tmp/mem_leak.rb:29:in `block in <main>'
	from tmp/mem_leak.rb:27:in `times'
	from tmp/mem_leak.rb:27:in `<main>'

After

rss: 137556 live objects 306646
rss: 259576 live objects 314866
rss: 261052 live objects 336258
rss: 268052 live objects 333226
rss: 269516 live objects 327710
rss: 270436 live objects 338718
rss: 269828 live objects 329114
rss: 269064 live objects 325514
rss: 271112 live objects 337218
rss: 271224 live objects 327934
rss: 273624 live objects 343234
rss: 271752 live objects 333038
rss: 270212 live objects 329618
rss: 272004 live objects 340978
rss: 270160 live objects 333350
rss: 271084 live objects 319266
rss: 272012 live objects 339874
rss: 271564 live objects 331226
rss: 270544 live objects 322366
rss: 268480 live objects 333990
rss: 271676 live objects 330654

Looks like memory stabilizes and live object count stabilizes after the fix.

Summary

The current tooling offered in Ruby offer spectacular visibility into the Ruby runtime. The tooling around this new instrumentation is improving but still rather rough at spots.

As a .NET developer in a previous lifetime I really missed the excellent memory profiler tool, luckily we now have all the information needed to be able to build tools like this going forward.

Good luck hunting memory leaks, I hope this information helps you and please think twice next time you deploy a "unicorn OOM killer".

A huge thank you goes to Koichi Sasada and Aman Gupta for building us all this new memory profiling infrastructure.

Commenting powered by Discourse

$
0
0

Leaving comments on this blog requires a certain amount of commitment. You have to jump to another site to log in.

Compare this to the "least amount of friction possible".

A lot of work.

When I made the move to Discourse I thought of this state-of-affairs as a temporary situation. I would add the "traditional" comment box at the bottom and make it super easy to add comments, I would transparently create accounts and all that jazz.

A couple of months in, I am not so sure.

When I think about comments on my blog these are my priorities.

  1. Give users room to type in interesting and insightful comments.
  2. Provide great support for followup (reply by email, email notifications)
  3. Rich markdown editor with edit and preview support.
  4. Comment format must be markdown, anything else and I am risking complex conversion later on.
  5. Zero spam
  6. Comments / emails / content is unconditionally hosted on my server and under my control. Not hosted by some third party under their rules with their advertising injected and my readers tracked.
  7. Trivial for me to moderate.

You may notice that, "Make it super easy for anybody on the Internet to contribute a random unfiltered opinion" is surprisingly missing from this list.

What does Discourse score in my 7 point dream list? A solid 7 out of 7. The extra friction completely eradicated spam and enables me to have rich conversations with my readers. I have their emails, I can communicate with them.

About Spam

I eliminated the vast majority of spam on this blog a while back. I blogged about it 2 years ago. Critics claimed that this approach was doomed to fail if it ever got popular.

Discourse is popular. Yet I get zero spam. By zero I mean that in the last 55 days I got no spam on this blog, nor did I have to delete any spam from this blog.

http://discuss.samsaffron.com the site that takes comments for this blog is excluded from Google using a robots.txt rule, coupled with the built in immune system Discourse already has, this leaves spamming software very confused. Not only do they need to run a full PhantomJS like engine, they also need to register accounts and know about the tie discuss.samsaffron.com has to this blog.

Too much work, so no Rolex for me. This sucks cause I really want a v8gra branded Rolex watch.

Do I care about all the missing comments I am not getting?

My priority has shifted strongly, I would prefer to engage in interesting conversations as opposed to collecting a massive collage of "great job +1" comments. My previous blog post is a great example. I have room to expand points, add code samples and so on. I have the confidence that my replies will be read by the people who asked for the extra info.

I often hear people say "just disable comments on your blog" as a general solution for low quality and spam. I feel I have found a different way here, and I love it.

I am not sure I will bring back the trivial "add comment" text box.


Call to Action: Long running Ruby benchmark

$
0
0

I would love a long running Ruby and Rails set of benchmarks. I talked about this at GoGaRuCo and would like to follow up.

For a very long time Python has had the pypy speed center:

Recently, golang has added its own: http://goperfd.appspot.com/perf

Why is this so important?

Writing fast software requires data. We need to know right away when our framework or platform is getting slower or faster. This information can be fed directly to the team informing them of big wins and losses. Often small changes can lead to unexpected gains or losses.

Finding out about regressions months in to the development cycle can often incur a massive cost, fixing bugs early on is cheap. Usually the longer we wait the more expensive it is to fix.


Source: NASA

Imagine if for every Rails major release the team could announce not only that it is N% faster in certain areas but also attribute the improvements to particular commits.

Imagine if we could have a wide picture about the performance gains and losses of a new Ruby versions, given full context to the reason why something slowed down or sped up.

What we have today?

We have a fair amount

The Discourse benchmark can be used to benchmark a "real world" app, it integrates the entire system. The other small microbenchmarks can be used to bench specific features and areas.

The importance of stable dedicated hardware

The server provisioned by Ninefold is a bare metal server, once we ensure power management is disabled it can be used to provide consistent results, unlike virtual hosts which are often dependent on external factors. We need to produce results and reproduce them on multiple sets of hardware to ensure we did not mess up our harness.

Tracking all metrics

There are a sea of metrics we can gather. GC times, Rails bootup times, Memory usage, Page load times, RSS, Requests per second and so on. We don't need to be shoe horned into tiny micro benches. When tracking performance we need to focus on both the narrow and wide.

Often performance is a game of trade-offs, you make some areas slower so some other, more important areas, become faster.

Raw execution speed, memory usage and disk usage all matter. Our performance can depend on memory allocators (like jemalloc or tcmalloc) and GC tuning, we can measure some specific best practice environments when gathering stats.

Graphing some imperfect data

Koichi Sasada , Aman Gupta and others have been tirelessly working to improve the Ruby GC in the last few months.

I ran the Discourse bench and few other tests on 15 or so builds. Data is imperfect and needs to be reproduced, that said this is a starting point.

Here we can see a graph showing how the "long living" object count has reduced from build to build, somewhere between the end of November and December there was a huge decrease.

Here I am graphing the median time for a homepage request on Discourse over a bunch of builds

There are two interesting jumps, the first was a big slowdown when the RGenGC was introduced. Later in mid November we recovered the performance but it has regressed since.

Here it is clear to see the massive improvement the generational GC provided to the 75th percentile. Ruby 2.1 is going to be a massive improvement for those not doing any GC tuning.

Similarly Rails boot is much improved.

What we need?

A long term benchmarking project will take quite a while to build, personally I can not afford to dedicate more that a few hours per week.

Foremost, we need people. Developers to build a UI, integrate existing tools and add benchmarks. Designers to make a nice site.

Some contributions to related projects can heavily improve the "benchmarking" experience, faster builds and faster gem installs would make a tremendous difference.

The project can be split quite cleanly into 2 sub-projects. First is information gathering, writing the scripts needed to collect all the historical data into a database of sorts. The second part is a web UI to present the results potentially reusing or extending https://github.com/tobami/codespeed .

More hardware and a nice domain name would also be awesome.

If you want to work on this, help build UI or frameworks for running long term benchmarks, contact me (either here or at sam.saffron<at>gmail.com)

Building a long term benchmark is critical for longevity and health of Ruby and Rails. I really hope we can get started on this soon. I wish I had more time to invest in this.

Update 17-12-2013

We started gathering info and people at http://community.miniprofiler.com/t/ruby-bench-intros/185 , feel free to post an intro there or create and ruby-bench topics.

Vintage JavaScript begone

$
0
0

The Problem

These days all the cool kids are using Ember.JS or Angular or Meteor or some other single page web application. If you deploy often, like we do at Discourse, you have a problem.

How can you get everyone to run the latest version of your JavaScript and CSS bundle? Since people do not reload full pages and just navigate around accumulating small json payload there is a strong possibility people can be on old versions and experience weird and wonderful odd bugs.

The message bus

One BIG criticism I have heard of Rails and Django lately is the lack of "realtime" support. This is an issue we foresaw over a year ago at Discourse.

Traditionally, people add more component to a Rails system to support "realtime" based notifications. Be it Ruby built systems like faye, non Ruby systems like Node.JS with socket.io or outsourced systems like Pusher. For Discourse none of these were an option. We could not afford to complicate the setup process or outsource this stuff to a third party.

I built the message_bus gem to provide us with an engine for realtime updates:

https://github.com/SamSaffron/message_bus

At the core of it message_bus allows you a very simple API to publish and subscribe to messages on the client:

# in ruby
MessageBus.publish('/my_channel', 'hello')

<!-- client side --><script src="message-bus.js" type="text/javascript"></script><script>
MessageBus.subscribe('/my_channel', function(data){
  alert(data);
});</script>

Behind this trivial API hides a fairly huge amount of feature goodness:

  1. This thing scales really well, clients "pull" information from a reliable pub sub channel, minimal per-client house keeping.
  2. Built in security (send messages to user or groups only)
  3. Built on rack hijack and thin async, so we support passenger, thin, unicorn and puma.
  4. Uses long polling, with an event machine event loop, can easily service thousands of clients from a single web server.
  5. Built in multi-site support (for hosting sub domains)

We use this system at Discourse to notify you interesting things, update the topic pages live and so on.

On to the implementation

Given a message_bus, implementing a system for updating assets is fairly straight forward.

On the Rails side we calculate a digest that represents our application version.

def assets_digest
  @assets_digest ||= Digest::MD5.hexdigest(
         ActionView::Base.assets_manifest.assets.values.sort.join
   )
end

def notify_clients_if_needed
   # global channel is special, it goes to all sites
   channel = "/global/asset-version"
   message = MessageBus.last_message(channel)

   unless message && message.data == digest
      MessageBus.publish channel, digest
   end
end

With every full page we deliver to the clients we include this magic digest:

Discourse.set('assetVersion','<%= Discourse.assets_digest %>');

Then on the client side we listen for version changes:

Discourse.MessageBus.subscribe("/global/asset-version", function(version){
  Discourse.set("assetVersion",version);

  if(Discourse.get("requiresRefresh")) {
    // since we can do this transparently for people browsing the forum
    //  hold back the message a couple of hours
    setTimeout(function() {
      bootbox.confirm(I18n.lookup("assets_changed_confirm"), function(){
        document.location.reload();
      });
    }, 1000 * 60 * 120);
  }
});

Finally, we hook into the transition to new routes to force a refresh if we detected assets have changed:

routeTo: function(path) {

  if(Discourse.get("requiresRefresh")){
    document.location.href = path;
    return;
   }
 //...
}

Since in the majority of spots we render "full links" and pass them through this magic method this works well. Recently Robin added a second mechanism that allows us to trap every transition, however it would require a double load which I wanted to avoid:

Eg, the following would also work

Discourse.PageTracker.current().on('change', function() {
   if(Discourse.get("requiresRefresh")){
    document.location.reload();
    return;
   }
});

Summary

I agree that every single page application needs some sort of messaging bus, you can have one today, on Rails, if you start using the message_bus.

Real-time is not holding us back with Rails, it is production ready.

Ruby 2.1 Garbage Collection: ready for production

$
0
0

The article "Ruby Garbage Collection: Still Not Ready for Production" has been making the rounds.

In it we learned that our GC algorithm is flawed and were prescribed some rather drastic and dangerous workarounds.

At the core it had one big demonstration:

Run this on Ruby 2.1.1 and you will be out of memory soon:

while true
  "a" * (1024 ** 2)
end

Malloc limits, Ruby and you

From very early versions of Ruby we always tracked memory allocation. This is why I found FUD comments such as this troubling:

the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory

This is clearly misunderstanding Ruby. In fact, the aforementioned article does nothing to mention memory allocation may trigger a GC.

Historically Ruby was quite conservative issuing GCs based on the amount of memory allocated. Ruby keeps track of all memory allocated (using malloc) outside of the Ruby heaps between GCs. In Ruby 2.0, out-of-the-box every 8MB of allocations will result in a full GC. This number is way too small for almost any Rails app, which is why increasing RUBY_GC_MALLOC_LIMIT is one of the most cargo culted settings out there in the wild.

Matz picked this tiny number years ago when it was a reasonable default, however it was not revised till Ruby 2.1 landed.

For Ruby 2.1 Koichi decided to revamp this sub-system. The goal was to have defaults that work well for both scripts and web apps.

Instead of having a single malloc limit for our app, we now have a starting point malloc limit that will dynamically grow every time we trigger a GC by exceeding the limit. To stop unbound growth of the limit we have max values set.

We track memory allocations from 2 points in time:

  • memory allocated outside Ruby heaps since last minor GC
  • memory allocated since last major GC.

At any point in time we can get a snapshot of the current situation with GC.stat:

> GC.stat
=> {:count=>25,
 :heap_used=>263,
 :heap_length=>406,
 :heap_increment=>143,
 :heap_live_slot=>106806,
 :heap_free_slot=>398,
 :heap_final_slot=>0,
 :heap_swept_slot=>25258,
 :heap_eden_page_length=>263,
 :heap_tomb_page_length=>0,
 :total_allocated_object=>620998,
 :total_freed_object=>514192,
 :malloc_increase=>1572992,
 :malloc_limit=>16777216,
 :minor_gc_count=>21,
 :major_gc_count=>4,
 :remembered_shady_object=>1233,
 :remembered_shady_object_limit=>1694,
 :old_object=>65229,
 :old_object_limit=>93260,
 :oldmalloc_increase=>2298872,
 :oldmalloc_limit=>16777216}

malloc_increase denotes the amount of memory we allocated since the last minor GC. oldmalloc_increase the amount since last major GC.

We can tune our settings, from "Ruby 2.1 Out-of-Band GC":

RUBY_GC_MALLOC_LIMIT: (default: 16MB)
RUBY_GC_MALLOC_LIMIT_MAX: (default: 32MB)
RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR: (default: 1.4x)

and

RUBY_GC_OLDMALLOC_LIMIT: (default: 16MB)
RUBY_GC_OLDMALLOC_LIMIT_MAX: (default: 128MB)
RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR: (default: 1.2x)

So, in theory, this unbound memory growth is not possible for the script above. The two MAX values should just cap the growth and force GCs.

However, this is not the case in Ruby 2.1.1

Investigating the issue

We spent a lot of time ensuring we had extensive instrumentation built in to Ruby 2.1, we added memory profiling hooks, we added GC hooks, we exposed a large amount of internal information. This has certainly paid off.

Analyzing the issue raised by this mini script is trivial using the gc_tracer gem. This gem allows us to get a very detailed snapshot of the system every time a GC is triggered and store it in a text file, easily consumable by spreadsheet.

We simply add this to the rogue script:

require 'gc_tracer'
GC::Tracer.start_logging("log.txt")

And get a very detailed trace back in the text file:

In the snippet above we can see minor GCs being triggered by exceeding malloc limits (where major_by is 0) and major GCs being triggered by exceeding malloc limits. We can see out malloc limit and old malloc limit growing. We can see when GC starts and ends, and lots more.

Trouble is, our limit max for both oldmalloc and malloc grows well beyond the max values we have defined:

So, bottom line is, looks like we have a straight out bug.

https://bugs.ruby-lang.org/issues/9687

I one line bug, that will be patched in Ruby 2.1.2 and is already fixed in master.

Are you affected by this bug?

It is possible your production app on Ruby 2.1.1 is impacted by this. Simplest way to find out is to issue a GC.stat as soon as memory usage is really high.

The script above is very aggressive and triggers the pathological issue, it is quite possibly you are not even pushing against malloc limits. Only way to find out is measure.

General memory growth under Ruby 2.1.1

A more complicated issue we need to tackle is the more common "memory doubling" issue under Ruby 2.1.1. The general complaint goes something along the line of "I just upgraded Ruby and now my RSS has doubled"

This issue is described in details here: https://bugs.ruby-lang.org/issues/9607

Memory usage growth is partly unavoidable when employing a generational GC. A certain section of the heap is getting scanned far less often. It's a performance/memory trade-off. That said, the algorithm used in 2.1 is a bit too simplistic.

If ever an objects survives a minor GC it will be flagged as oldgen, these objects will only be scanned during a major GC. This algorithm is particularly problematic for web applications.

Web applications perform a large amount of "medium" lived memory allocations. A large number of objects are needed for the lifetime of a web request. If a minor GC hits in the middle of a web request we will "promote" a bunch of objects to the "long lived" oldgen even though they will no longer be needed at the end of the request.

This has a few bad side effects,

  1. It forces major GC to run more often (growth of oldgen is a trigger for running a major GC)
  2. It forces the oldgen heaps to grow beyond what we need.
  3. A bunch of memory is retained when it is clearly not needed.

.NET and Java employ 3 generations to overcome this issue. Survivors in Gen 0 collections are promoted to Gen 1 and so on.

Koichi is planning on refining the current algorithm to employ a somewhat similar technique of deferred promotion. Instead of promoting objects to oldgen on first minor GC and object will have to survive two minor GCs to be promoted. This means that if no more than 1 minor GC runs during a request our heaps will be able to stay at optimal sizes. This work is already prototyped into Ruby 2.1 see RGENGC_THREEGEN in gc.c (note, the name is likely to change). This is slotted to be released in Ruby 2.2

We can see this problem in action using this somewhat simplistic test:

@retained = []
@rand = Random.new(999)

MAX_STRING_SIZE = 100

def stress(allocate_count, retain_count, chunk_size)
  chunk = []
  while retain_count > 0 || allocate_count > 0
    if retain_count == 0 || (@rand.rand < 0.5 && allocate_count > 0)
      chunk << " " * (@rand.rand * MAX_STRING_SIZE).to_i
      allocate_count -= 1
      if chunk.length > chunk_size
        chunk = []
      end
    else
      @retained << " " * (@rand.rand * MAX_STRING_SIZE).to_i
      retain_count -= 1
    end
  end
end

start = Time.now
# simulate rails boot, 2M objects allocated 600K retained in memory
stress(2_000_000, 600_000, 200_000)

# simulate 100 requests that allocate 100K objects
stress(10_000_000, 0, 100_000)


puts "Duration: #{(Time.now - start).to_f}"

puts "RSS: #{`ps -eo rss,pid | grep #{Process.pid} | grep -v grep | awk '{ print $1;  }'`}"
```

In Ruby 2.0 we get:

```
% ruby stress.rb
Duration: 10.074556277
RSS: 122784
```

In Ruby 2.1.1 we get:

```
% ruby stress.rb
Duration: 7.031792076
RSS: 236244
```

Performance has improved, but memory almost doubled.

To mitigate the current pain point we can use the new `RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR` environment var.

Out of the box we trigger a major gc if our oldobject count doubles. We can tune this down to say `1.3` times and see a significant improvement memory wise:

```
% RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.3 ruby stress.rb
Duration: 6.85115156
RSS: 184928
```

On memory constrained machines we can go even further and disable generational GC altogether.

```
% RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb
Duration: 6.759709765
RSS: 149728
```

We can always add jemalloc for good measure to shave off an extra 10% percent or so:

```
LD_PRELOAD=/home/sam/Source/jemalloc-3.5.0/lib/libjemalloc.so RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb
Duration: 6.204024629
RSS: 144440
```

If that is still not enough you can push malloc limits down (and have more GCs run due to hitting it)

```
% RUBY_GC_MALLOC_LIMIT_MAX=8000000 RUBY_GC_OLDMALLOC_LIMIT_MAX=8000000  LD_PRELOAD=/home/sam/Source/jemalloc-3.5.0/lib/libjemalloc.so RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 ruby stress.rb
Duration: 9.02354988
RSS: 120668

Which is nice since we are back to Ruby 2.0 numbers now and lost a pile of performance.

Ruby 2.1 is ready for production

Ruby 2.1 has been running in production at GitHub for a few months with great success. The 2.1.0 release was a little rough 2.1.1 addresses the majority of the big issues it had. 2.1.2 will address the malloc issue, which may or may not affect you.

If you are considering deploying Ruby 2.1 I would strongly urge giving GitHub Ruby a go since it contains a fairly drastic performance boost due to funny-falcons excellent method cache patch.

Performance has much improved at the cost of memory, that said you can tune memory as needed and measure impact of various settings effectively.

Summary

  • If you discover any issues, please report them on https://bugs.ruby-lang.org/
  • Use Ruby 2.1.1 in production, upgrade to 2.1.2 as soon as it is released
  • Be sure to look at jemalloc and GC tuning for memory constrained systems. See also: https://bugs.ruby-lang.org/issues/9113
  • Always be measuring. If you are seeing issues run GC.stat, you can attach to the rogue process using rbtrace a gem you should consider including on production systems.

Resources:

Speeding up Rails 4.2

$
0
0

Recently Godfrey Chan got Discourse working on Rails Master. It was a rather long task that involved some changes to Discourse internals and some changes to Rails internals.

Knowing Rails 4.2 was just around the corner I decided that it seemed like the perfect time to see how performance is. Seeing Rails 4.2 contains the adequate record patches, I was super eager to see how Discourse faired.

The Discourse benchmark

Answering the question of "how fast is Ruby?" or "how fast is Rails?" is something people usually answer with the following set of steps.

  1. Upgrade production to a certain version of Rails
  2. Look at New Relic reports
  3. Depending on how performance is either complain on Twitter and/or Blog or write a few kind words.

The trouble is that performance testing after release is the absolute worst time. Code is in production, fixing stuff is tricky, rolling back often impractical.

At Discourse I developed the Discourse benchmark. It is a simple script that loads up a few thousand users and topics and then proceeds to measure performance of various hot spots using apache bench.

I have found that the results of the benchmark are strikingly similar to real-world performance we see. If the benchmark is faster, it is very likely that it will also be faster in production. We use this benchmark to test Ruby before major Ruby releases. Koichi used this benchmark to help optimise the current GC algorithms in MRI.

It is often tempting to look at micro benchmarks when working on performance issues. Micro benchmarks are a great tool, but MUST be followed with bigger macro benchmarks to see the real impact. 1000% speedup for a routine that is called once a day in a background job has significantly less impact than 0.5% improvement to a routing that is called on every web request.

How was Rails 4.2 Looking?

For over a year now, Discourse had the ability to dual boot. This allowed me to quickly run a simple benchmark to see where we were at (arel , rails):

% RAILS_MASTER=1 ruby script/bench.rb

name:
  [percentile]: [time]

categories_admin:
  50: 123
  75: 136
  90: 147
  99: 224
home_admin:
  50: 111
  75: 122
  90: 147
  99: 224
topic_admin:
  50: 69
  75: 81
  90: 89
  99: 185
categories:
  50: 81
  75: 126
  90: 138
  99: 211
home:
  50: 53
  75: 63
  90: 100
  99: 187
topic:
  50: 20
  75: 21
  90: 23
  99: 77
timings:
  load_rails: 3052
ruby-version: 2.1.2-p95
rss_kb: 241604

VS.

categories_admin:
  50: 62
  75: 66
  90: 77
  99: 193
home_admin:
  50: 51
  75: 53
  90: 55
  99: 175
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 87
categories:
  50: 53
  75: 55
  90: 98
  99: 173
home:
  50: 35
  75: 37
  90: 55
  99: 154
topic:
  50: 12
  75: 12
  90: 13
  99: 66
timings:
  load_rails: 3047
ruby-version: 2.1.2-p95
rss_kb: 263948

Across the board we were running at around half speed. Pages that were taking 12ms today were taking 20ms on Rails 4.2.

Given these results I decided to take some time off my regular work and try to isolate what went wrong. Contribute a few patches and help the Rails team correct the issue prior to the impending RC release.

Before I continue I think it is worth mentioning that the lion's share of the performance optimisations we ended up adding were authored by Sean Griffin so a big huge thanks to him.

Cracking open the black box

I find flame graphs are instrumental at quickly discovering what is going on. Flame graphs in Ruby 2.1 now have better fidelity than they did in 2.0 thanks to feeding directly off stackprof.

My methodology is quite simple.

  1. Go to a page in Rails 4.1.8 mode, reload a few times, then tack on ?pp=flamegraph to see a flamegraph
  2. Repeat the process on Rails 4.2
  3. Zoom in on similar area and look for performance regressions.

For example here is the before graph on Rails 4.1.8 (home page)

rails_baseline.html (214.9 KB)

Compared to the after graph on Rails Master (home page)

rails_master_v1.html (253.6 KB)

The first point of interest is the sample count:

Rails 4.1.8

vs.

Master

Flamgraph attempts to grab a sample every 0.5ms of wall time, the reality is that due to depth of stack, timers and so on you end up getting a full sample every 0.5ms - 2ms. That said we can use the sample counts a good rough estimate of execution time.

It looks like Rails master is taking up 108 samples where current Rails was taking 74, a 46% increase in the above example.

We can see a huge amount of the additional overhead is spent in arel.

Next we zoom in to particular similar areas and compare:

Rails 4.1.8

Master

We can see expensive calls to reorder_bind_params that we were not seeing in Rails 4.2

Clicking on the frame reveals:

So 14.5% of our cost is in this method.

Additionally we see other new areas of overhead:

Like a method missing being invoked when casting DB values

We repeat the process for various pages in our app, and discover that 3.2% of the time on the categories page is spent in ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key and a pile of other little cuts.

With this information we can start working on various fixes:

There are a lot of little cuts and some big cuts, here are some examples

The list goes on, slowly but surely we regain performance.

Memory profiling

Another very important tool in our arsenal is memory profiler. If we include it in our Gemfile, rack-mini-profiler will wire it up for us, then appending pp=profile-gc-ruby-head to any page gives us a memory report (note this requires Ruby 2.1 as it provides the new APIs needed):

It reveals huge allocation increase in Rails master, which was allocating 131314 objects on the front page in Rails master vs 17651 object in Rails 4.1.8

We can dig down and discover a huge addition of allocations in arel and activerecord

arel/lib x 107607
activerecord/lib x 12778

We can see the vast majority is being allocated at:

arel/lib/arel/visitors/visitor.rb x 106435

We can track down the 3 particular lines:

arel/lib/arel/visitors/visitor.rb:11 x 74432
lib/arel/visitors/visitor.rb:13 x 22880
lib/arel/visitors/visitor.rb:12 x 7040

This kind of analysis helps us fix issues like:

and

It is very important to be able to prove you have a real issue before applying optimisations blindly. Memory profiler allows us to pinpoint specifics and demonstrate the actual gain.

It is fascinating to see which particular strings are allocated where. For example the empty string is allocated 702 times during a front page load:

"" x 702
    activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb:433 x 263
    activerecord/lib/active_record/type/boolean.rb:11 x 153
    omniauth-1.2.2/lib/omniauth/strategy.rb:396 x 84

We can work through this information, and paper cut by paper cut eliminate a large number of allocations resulting in better performance.

Where are we with Rails 4.2 RC1

categories_admin:
  50: 69
  75: 74
  90: 86
  99: 165
home_admin:
  50: 55
  75: 57
  90: 66
  99: 133
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 108
categories:
  50: 56
  75: 58
  90: 60
  99: 192
home:
  50: 39
  75: 40
  90: 42
  99: 165
topic:
  50: 13
  75: 13
  90: 14
  99: 75

So it looks like we are more or less 10% slower that 4.1 which is a huge improvement over where we were a couple of weeks ago (considering we were almost 100% slower).

There is still plenty of work left to do, but we are on the right path.

A taste of things to come

Lars Kanis a maintainer of the pg gem has been working for a while now on a system for natively casting values in the pg gem. Up till now the pg gem has followed libpq quite closely. When you query the DB using libpq you are always provided with strings. Historically the pg gem just echos it back to the Ruby program.

So for example:

 ActiveRecord::Base.connection.raw_connection.exec("select 1").values[0][0]
=> "1"

The new pg release allows us to get back a number here, as opposed to a string. This helps us avoid allocating, converting and garbage collecting a string, just to get a number from our database. The cost of these conversions adds up quite a lot. It is much cheaper to do conversions in C code and avoid the allocation of objects that the Ruby GC needs to track.

cnn = ActiveRecord::Base.connection.raw_connection; puts
cnn.type_map_for_results = PG::BasicTypeMapForResults.new cnn; puts
cnn.exec('select 1').values[0][0]
=> 1

The tricky thing though is that people heavily over-query when using Active Record.

# no need to perform conversions for created_at and updated_at
car_color = Car.first.color

# above is more efficiently written as car_color = Car.limit(1).pluck(:color).first

If you perform conversions upfront for columns people never consume you can end up with a severe performance hit. In particular converting times is a very expensive operation which involves string parsing.

In my benchmarking it is showing a 5-10% improvement for the Discourse bench which in almost all cases means we are faster than Rails 4.1.

I think it is fairly likely that we can be 10-20% faster than Rails 4.1 by the time Rails 5 ships provided we watch the benchmarks closely and keep optimising.

Side Effects

A very interesting and rewarding side effect of this work was a series of optimisations I was able to apply to Discourse while looking at all the data.

In particular I built us a distributed cache powered by the message_bus gem. A Hash like structure that is usable from any process running Discourse (Discourse instances can span multiple machines). Using it I was able to apply caching to 3 places that we sorely needed to and avoid expensive database work.

Conclusion

The recent performance work strongly highlights the importance of a long running benchmark. There is no single magical method we can fix that will recover a year of various performance regressions. The earlier we pick up on regressions the easier they are to fix.

It also highlight the "paper cut" nature of this kind of work. It is easy to be flippant and say that 1% here and 0.5% percent there does not matter. However, repeat this process enough and you have a product that is 100% slower. We need to look at trends to see where we are at. Slow down dev and work on performance when performance regresses.

The Rails team did an admirable job at addressing a lot of the perf regressions we found in very short order. Sean did some truly impressive work. Godfrey did a spectacular job getting Discourse running on Rails master.

However, I am very sad that we do not have long running benchmark for either Ruby or Rails and worry that if we don't we are bound to repeat this story again.

Announcing rubybench.org

$
0
0

A year ago I put out a call to action for long running benchmarks.

I got a great response and lots of people lended a hand trying to get a system built. The project itself was quite slow moving, we had some working bits, but nothing really complete to show.

A few months ago Alan Guo Xiang Tan contacted me about the project. In the past few months Alan has been working tirelessly on the project.

Alan rebooted the project and built a Docker backend for running our performance tests and web frontend.

The first UI tracks performance across stable releases of Ruby. The goal is to provide you with a good gauge on what will happen when you upgrade Ruby.

  • Will it be faster?
  • Will it consume more or less memory?

The second UI tracks Ruby's progress through the various commits.

  • Did a particular commit improve/impact performance?
  • How is memory doing?

Rubybench is already helping ruby

In the very short amount of time rubybench was responsible for isolating a regression in the URI library. Our goal is for it to become an tool that the Ruby core team loves using. A tool that automatically finds regression and allows the core team to feel safer when making performance related changes.

Call for help

We still need lots of help with the project:

  • Design help with home page and logo.
  • Coding help with both the web and the runner.
  • Testing, to confirm all our testing methodology is correct and results reproducible.
  • Come to our forum and make some suggestions or ask some question.

Call for sponsors

We are in desperate need for a sponsor. We can not run rubybench without a bare metal server. We started a topic to discuss our requirements.

Contact me at sam.saffron at gmail.com, if you would like to help out!

FAQ

What about JRuby / Rubinius and other implementations?

For our initial release we focused on MRI, however if we have enough hardware and help we would love to include other implementations. We would also love to work with various implementers to ensure the bench suite is both accurate and reproducible.

Why all the focus on Docker

A guiding goal for the rubybench project is to have repeatable easy to generate test results. By using Docker we are able to ensure all our tests are repeatable.

Will you run benchmarks on cloud hosting services

We will not publish CPU results gathered on virtual hosts where we can not control our CPU allocation. The only results published on rubybench run on production level bare metal servers.


I am very excited about this project and hope it will help us all have a more awesome and faster Ruby!

A huge thank you goes to Alan for bootstrapping this project and getting it to a publishable state!

Viewing all 150 articles
Browse latest View live