At “my new job”:http://blog.stackoverflow.com/2010/06/welcome-stack-overflow-valued-associate-00008/ I have been looking at better ways of highlighting recent interesting answers.

Here is a list of 10 answers I came across in July that I found interesting. I tried to keep this language agnostic.

So here is my top 10 list:

h3. “What do the ‘&=’ and ‘=&’ operators do? [PHP]”:http://stackoverflow.com/questions/3177342/what-do-the-and-operators-do-php/3177365#3177365

Why did PHP decide to have an assign by reference construct, why?

h3. “Defeating a Poker Bot”:http://stackoverflow.com/questions/2717599/defeating-a-poker-bot/3168983#3168983

The most upvoted non-wiki answer on Stack Overflow for July. This answer came in months after the question was asked, it is really comprehensive. I loved the ideas about shifting all the pixels and throwing glitches, brilliant.

h3. “What is the _snowman param in Rails 3 forms for?”:http://stackoverflow.com/questions/3222013/what-is-the-snowman-param-in-rails-3-forms-for/3348524#3348524

Rails 3 will ship with a hack to work around oddities with … Internet Explorer and unicode. I love the creativity of the solution, and everybody loves a snowman ☃.

h3. “What is the fastest method for selecting descendant elements in jQuery?”:http://stackoverflow.com/questions/3177763/what-is-the-fastest-method-for-selecting-descendant-elements-in-jquery/3177782#3177782

For those writing javascript answers, “jsfiddle”:http://jsfiddle.net/ can make your answer so much more awesome. I love an answer that puts in the extra effort and measures performance.

h3. “Why < is slower than >=”:http://stackoverflow.com/questions/3369304/why-is-slower-than/3369477#3369477

I love the way Python lets you “disassemble your program”:http://docs.python.org/library/dis.html , wonderful feature.

h3. “C#: what is the difference between i++ and ++i?”:http://stackoverflow.com/questions/3346450/c-what-is-the-difference-between-i-and-i/3346729#3346729

“Eric Lippert”:http://blogs.msdn.com/b/ericlippert/ usually has the definitive answer to any intricate question you have about C#, he also happens to have to most comprehensive answers to the most trivial questions.

h3. “Haskell: How is <*> pronounced?”:http://stackoverflow.com/questions/3242361/haskell-how-is-pronounced/3242853#3242853

This is my favorite answer I came across, wow.

h3. “Scala – how to explicitly choose which overloaded method to use when one arg must be null?”:http://stackoverflow.com/questions/3169082/scala-how-to-explicitly-choose-which-overloaded-method-to-use-when-one-arg-must/3169147#3169147

I am no Scala programmer, however I found the way Scala casting syntax to a bit prettier than the C# syntax, I like that it is more concise.

h3. “Modelling a permissions system”:http://stackoverflow.com/questions/3177361/modelling-a-permissions-system/3177578#3177578

At some point in time we all hit a point where we need to design a permission system, this is a good summary of our options.

h3. “Are .Net switch statements hashed or indexed?”:http://stackoverflow.com/questions/3366376/are-net-switch-statements-hashed-or-indexed/3366497#3366497

You learn something new every day, the C# compiler does some funky magic when it complies a “switch” statement.

UPDATE Dapper is now open source

A few weeks ago we started investigating some performance issues at Stack Overflow.

Our web tier was running hot, it was often hitting 100% CPU. This was caused by a combination of factors, we made a few mistakes here and there. We allowed a few expensive operation to happen a bit too concurrently and … there were some framework issues.

Google’s monster spider machine

spider

Google loves Stack Overflow, it loves it so much that it will crawl 10 pages a second sometimes. This is totally by design, we told Google: “crawl us as fast as you can” and Google did it’s best to comply.

The trouble was that crawling our question pages was expensive business.

We maintain a list of related questions on every question page, this list is rebuilt once a month in a on demand background process. This process is usually pretty fast, however when its happening 100,000 times a day in huge bursts … it can get a bit costly.

So, we fixed that up. We changed some stuff so Google triggers less background activity. However, performance problems persisted.

In frameworks we trust

Our question show page does a fair bit of database work. It needs do a bunch of primary key lookups to pick up all the questions, answers, comments and participating users. This all adds up. I spent a fair bit of time profiling both production and dev; something kept on showing up in my cpu analyzer traces

System.Reflection.Emit.DynamicMethod.CreateDelegate
System.Data.Linq.SqlClient.ObjectReaderCompiler.Compile
System.Data.Linq.SqlClient.SqlProvider.GetReaderFactory
System.Data.Linq.SqlClient.SqlProvider.System.Data.Linq.Provider.IProvider.Compile
System.Data.Linq.CommonDataServices+DeferredSourceFactory`1.ExecuteKeyQuery
System.Data.Linq.CommonDataServices+DeferredSourceFactory`1.Execute
System.Linq.Enumerable.SingleOrDefault
System.Data.Linq.EntityRef`1.get_Entity

Much of our work at Stack Overflow depended on the assumption that LINQ-2-SQL is fast enough. Turns out it was not fast enough for us.

In the trace above you can see that EntityRef<T> is baking a method, which is not a problem, unless it is happening 100s of times a second.

Build your own ORM 101

There are tons of Object-Relational-Mappers for the .Net framework. There is an antique question on SO discussing which is best as well as probably tens if not hundreds of “similar”:http://stackoverflow.com/questions/66156/whose-data-access-layer-do-you-use-for-net questions.

When we started off Stack Overflow we chose LINQ-2-SQL. Both LINQ-2-SQL and the more recent magic unicorn Entity Framwork 4.1 are built using a similar pattern.

You give the ORM a query, it builds SQL out of it, then it constructs a DynamicMethod to materialize data coming back from SQL into your business objects. I am not privy to the exact implementation, it may be including more or less work in these dynamic methods.

This implementation is an abstraction, and it leaks. It leaks performance, you give up some control over the queries you can run and have to deal with a LEFT JOIN syntax that is so crazy it can make grown men cry.

The promise of CompiledQuery

Microsoft have been aware that in some cases you do not want your ORM re-generating the same method millions of times just to pull out the same Post objects from your DB. To overcome this there is a mechanism in place that allows you to elect to re-use the method it generates across queries. Much has been written about the performance you can gain.

The trouble is that the syntax is kind of clunky and there are certain places you can not use compiled queries. In particular EntityRef and EntitySet.

My solution

Due to various limitations much of our SQL is hand coded in LINQ ExecuteQuery block. This means we have lots of inline SQL similar to the following

var stuff = Current.DB.ExecuteQuery("select a,b,c from Foo where a = {0}", x)

This is the code we would always fall back to when performance was critical or we needed better control. We are all comfortable writing SQL and this feels pretty natural.

The trouble is that LINQ-2-SQL is still a bit slow when it comes to ExecuteQuery. So we started converting stuff to compiled queries.

I decided to experiment and see if I can out-perform ExecuteQuery by introducing a smarter caching mechanism. Once we had a replacement we could easily convert our existing code base to use it.

So, I wrote a tiny proof of concept ORM gist of rev1. IL weaving does get tricky.

It can act as a fast replacement for both compiled queries and ExecuteQuery methods. In future, I hope to release a more complete mapper so people can reuse it.

h2. Some simple benchmarks

Our workload at Stack Overflow is heavily skewed toward very very cheap queries that to a simple clustered index lookup. The DB returns the data instantly.

Take this for example:

create table Posts
(
    Id int identity primary key, 
    [Text] varchar(max) not null, 
    CreationDate datetime not null, 
    LastChangeDate datetime not null,
    Counter1 int,
    Counter2 int,
    Counter3 int,
    Counter4 int,
    Counter5 int,
    Counter6 int,
    Counter7 int,
    Counter8 int,
    Counter9 int
)

On my machine the cost of pulling out 100 random posts and turning them into Post objects are:

LINQ 2 SQL elapsed 335 ms
LINQ 2 SQL compiled 207 ms
LINQ 2 SQL ExecuteQuery 242 ms
Sams ORM 174 ms
Entity Framework 4.1 elapsed 550ms
Hand coded 164ms

So, LINQ-2-SQL can take double the amount of time to pull out our poor post, but that is not it. The trouble is that the extra 160ms is CPU time on the web server. The web server could simply be idle waiting for SQL to generate data, but instead it is busy rebuilding the same methods over and over.

You can make stuff a fair bit faster with compiled queries, however they are still as expected slower than hand coding. In fact there is quite a gap between hand coding and compiled queries.

Hand coding which is fastest is full of pain bugs and general sadness.


post.Id = reader.GetInt32(0);
post.Text = reader.GetString(1);
post.CreationDate = reader.GetDateTime(2);
post.Counter1 = reader.IsDBNull(4) ? (int?)null : reader.GetInt32(4);
// this makes me want to cry

I would take this over it any day:

var post = connection.ExecuteQuery("select * from Posts where Id = @Id", new {Id = 1}).First();

Another interesting fact is that Entity Framework is slowest and also most awkward to use. For one its context can not reuse an open connection.

Other benefits

You are probably thinking, oh-no, yet another ORM. I thought the world had more ORMs than blogging engines. Our use case is very narrow and I think it makes sense.

We are using our new ORM for a specific problem: mapping parameterized SQL to business objects. We are not using it as a full blown ORM. It does not do relationships and other bells and whistles. This allows us to continue using LINQ-2-SQL where performance does not matter and port all our inline SQL to use our mapper, since it is faster and more flexible.

This control means we can continue tweaking performance and adding features we have long waited for.

For example:

var awesome = connection.ExecuteQuery("select * from Posts where Id in (select * from @Ids)", new {Ids = new int[] {1,2,3}});

SQL 2008 has a new awesome Table-Valued Parameters, which would be awesome to use. Many ORMs shy away from supporting this cause you need to define a table type in SQL Server. However, we are in control here, so we can make it happen.

Recap

The end result of this round of optimization which involved a massive amount of rewriting, hardware tuning and such was that the question show page now only takes 50ms of server time to render. At its worst it was up at the 200ms server time. This performance boost was totally worth it. Optimizing LINQ-2-SQL played a key part of achieving this.

However, I would not recommend people go crazy and dump LINQ-2-SQL just because of this. My recommendation would be: measure, find bottlenecks and fix the ones that matter.

In this post I would like to walk through our internal process of tuning a particular page on Stack Overflow.

In this example I will be looking at our badge detail page. It is not the most important page on the site, however it happens to showcase quite a few issues.

Step 1, Do we even have a problem?

As a developer, when I browse through the sites I can see how long a page takes to render.

profiling time

Furthermore, I can break down this time and figure out what is happening on the page:

break down

To make our lives even easier, I can see a timeline view of all the queries that executed on the page (at the moment as an html comment):

SQL breakdown

This is made possible using our mini profiler and a private implementation of DbConnection, DbCommand and DbDataReader. Our implementations of the raw database access constructs provides us with both tracking and extra protection (If we figure out a way of packaging this we will be happy to open source).

When I looked at this particular page I quickly noticed a few concerning facts.

It looks like we are running lots of database queries, our old trusty friend N+1.
I noticed a very concerning fact, more than half of the time is spent on the web server.
We are running some expensive queries, in particular a count that takes 11ms and a select that takes about 50ms.

At this point, I need to decide if it makes sense to spend any more effort here. There may be bigger and more important things I need to work on.

Since we store all of the haproxy logs (and retain them for a month), I am able to see how long it takes to render any page on average, as well as how many times this process runs per day.

select COUNT(*), AVG(Tr) from 
dbo.Log_2011_05_01
where Uri like '/badges/%'

----------- -----------
26834       532

This particular family of pages is accessed 26k times a day; it takes us, on average, 532ms to render. Ouch … pretty painful. Nonetheless, I should probably look at other areas; this is not the most heavily accessed area in the site.

That said, Google takes into account the speed it takes to access your pages, if your page render time is high, it will affect your Google Ranking.

I wrote this mess. I know it can be done way faster.

Code review

The first thing I do is a code review,

var badges = from u2b in DB.Users2Badges
    join b in DB.Badges on u2b.BadgeId equals b.Id
    from post in DB.Posts.Where(p => p.Id == u2b.ReasonId && b.BadgeReasonType == BadgeReasonType.Post && p.DeletionDate == null).DefaultIfEmpty()
    from tag in DB.Tags.Where(t => t.Id == u2b.ReasonId && b.BadgeReasonType == BadgeReasonType.Tag).DefaultIfEmpty()
    where u2b.BadgeId == badge.Id
    orderby u2b.Date descending
    select new BadgesViewModel.BadgeInfo
    {
        Users2Badge = u2b,
        Post = post,
        Tag = tag,
        User = u2b.User,
        Badge = badge
    };
    var paged = badges.ToPagedList(page.Value, 60);

A typical LINQ-2-SQL multi join. ToPagedList performs a Count, and offsets the results to the particular page we are interested using Skip and Take.

Nothing horribly wrong here, in fact this probably considered a best practice.

However as soon as we start measuring I quickly notice that even though the SQL for this on local only takes 12 or so milliseconds, the total time it takes to execute the above code is much higher, profiling the above block shows that a 90ms execution time.

The LINQ-2-SQL abstraction is leaking 78ms of performance, OUCH. This is because it needs to generate a SQL statement from our fancy inline LINQ. It also needs to generate deserializers for our 5 objects. 78ms of overhead for 60 rows, to me seems unreasonable.

Not only did performance leak here, some ungodly SQL was generated … twice:

Dapper to the rescue

The first thing I set to do is rewrite the same query in raw SQL using dapper. Multi mapping support means it is really easy to do such a conversion.

var sql = @"select  ub.*, u.*, p.*, t.* from 
(
    select *, row_number() over (order by Date desc) Row
    from Users2Badges 
    where BadgeId = @BadgeId
) ub
join Badges b on b.Id = ub.BadgeId
join Users u on u.Id = ub.UserId
left join Posts p on p.Id = ub.ReasonId and b.BadgeReasonTypeId = 2
left join Tags t on t.Id = ub.ReasonId and b.BadgeReasonTypeId = 1
where Row > ((@Page - 1) * @PerPage) and Row <= @Page * @PerPage 
order by ub.Date desc
";

var countSql = @"select count(*) from Users2Badges ub join Users u on u.Id = ub.UserId where BadgeId = @Id");

var rows = Current.DB.Query<Users2Badge, User, Post, Tag, BadgesViewModel.BadgeInfo>(sql, (ub, u, p, t) => new BadgesViewModel.BadgeInfo
            {
                Users2Badge = ub,
                User = u,
                Post = p,
                Tag = t,
                Badge = badge
            }, new { Page = page, BadgeId = badge.Id, PerPage = 60 }).ToList();

var count = Current.DB.Query<int>(countSql, new { Id = id.Value }).First();

I measure 0.2ms of performance leak in dapper … sometimes. Sometimes it is too close to count. We are making progress. Personally, I also find this more readable and debuggable.

This conversion comes with a multitude of wonderful side effects. I now have SQL I can comfortably work with in SQL Management Studio. In SQL Management Studio, I am superman. I can mold the query into the shape I want it and I can profile it. Also the count query is way simpler than the select query.

We are not done quite yet

Removing the ORM overhead is one thing, however we also had a few other problems. I noticed that in production this query was taking way longer. I like to keep my local DB in sync with production so I can experience similar pains locally. So this smells of a bad execution plan being cached. This usually happens cause indexes are missing. I quickly run the query with an execution plan and see:

index req

And one big FAT line:

fat line

Looks like the entire clustered index on the Users2Badges table is being scanned just to pull out the first page. Not good. When you see stuff like that it also introduces erratic query performance; What will happen when the query needs to look at a different badge or different page? This query may be “optimal” given the circumstances, for page 1 of this badge. However, it may be far from optimal for a badge that only has 10 rows in the table. An index scan + bookmark lookup may do the trick there.

In production, depending on who hit what page first for which badge, you are stuck with horrible plans that lead to erratic behavior. The fix is NOT to put in OPTION (RECOMPILE) it is to correct the indexing.

So, SQL Server tells me it wants an index on Users2Badges(BadgeId) that includes (Id,UserId,Date,Comment,ReasonId). I will try anything once, so I give it a shot, and notice there is absolutely no performance impact. This index does not help.

Recommendation from the execution plan window are often wrong

We are trying to pull out badge records based on badge id ordered on date. So the correct index here is:

CREATE NONCLUSTERED INDEX Users2Badges_BadgeId_Date_Includes 
ON [dbo].[Users2Badges] ([BadgeId], [Date])
INCLUDE ([Id],[UserId],[Comment],[ReasonId])

This new index cuts SQL time on local by a few milliseconds, but more importantly, it cuts down the time it takes to pull page 100 by a factor of 10. The same plan is now used regardless of Page or Badge from a cold start.

Correcting the N+1 issue.

Lastly, I notice 50-60ms spent on an N+1 query on my local copy, fixing this is easy. Looks like we are pulling parent posts for the posts associated with the badges, the profiler shows me that this work is deferred to the View on a per row basis. I quickly change the ViewModel to pull in the missing posts and add a left join to the query to pull them in. Easy enough.

Conclusion

After deploying these fixes in production, load time on the page went down from 630ms to 40ms, for pages deep in the list it is now over 100x faster.

Tuning SQL is key, the simple act of tuning it reduced the load time for the particular page by almost 50% in production. However, having a page take 300ms just because your ORM is inefficient is not excusable. The page is now running at the totally awesome speed of 20-40ms render time. In production. ORM inefficiency cost us a 10x slowdown.

After a mammoth effort by Jarrod Dixon the team’s production profiler is now ready for an open source release.

http://code.google.com/p/mvc-mini-profiler/

Let me start with a bold claim. Our open-source profiler is perhaps the best and most comprehensive production web page profiler out there for any web platform.

There I said it, so let me back up that statement.

The stone-age of web site profiling

Ever since the inception of the web people have been interested in render times for web pages. Performance is a feature we all want to have.

To help keep sites fast many developers include the time it takes to render a page in the footer or an html comment. This is trivial on almost any platform and has been used for years. However the granularity of this method sucks. You can tell a page is slow, but you really want to know is, why?

To overcome this issue often frameworks use logs, like the famous Rails development.log. The trouble is that it is often very hard to understand log files and this information is tucked away in a place you often do not look at.

Some people have innovated and taken this to the next level Rack Bug is a good example for Rails and L2S Prof is a good example for .Net. Additionally some products like NewRelic take a holistic view on performance and give you a dashboard on the cloud with the ability to investigate particular perf issues down to the SQL.

The trouble has always been that the trivial profilers don’t help much. The nice ones are often not designed to work in production and often involve external tools, installs and dependencies. One clear exception is NewRelic, an excellent product. However when dealing with a single web page I think our profiler has an edge.

An ubiquitous production profiler

Our “dream” with our profiler was to have a way for us to get live, instant feedback on the performance characteristics of the pages we are visiting – in production. We wanted this information to be only visible to developers and have no noticeable impact on the site’s performance. A tricky wish to fill seeing our network sees millions of page views a day.

screen shot

The ubiquity of the profiler is key, developers have become aware of slowness and the reasons for it in every day usage of the site. Analyzing the source of performance problems is trivial since you are able to drill-down on and share profile sessions. We have become much more aware of performance issues and mindful of slow ajax calls (since they are also displayed on the page as they happen)

Production and development performance can vary by a huge margin

Ever since we started using our profiler in production we noticed some “strange” things. Often in dev a page would be fast and snappy, but in production the same page had very uneven performance. Often we traced this back to internal locking in LINQ-2-SQL and ported queries to dapper. This does however bring up a very important fact.

Development performance may be wildly different to production performance.

Page profiling vs. the holistic view

Internally we use 2 levels of production profiling. We log every request with it’s total render time in a database (via haproxy log), this gives us a birds-eye view of performance. When we need to dig in on actual issues we use our profiler.

Both approaches are complimentary and in-my-view necessary for a high performance, high scale website. Efforts are much better spent optimizing a page that is hit 100k times a day vs. an equally slow page that is only accessed a handful of times.

This kind of functionality should be baked in to web frameworks

I find it strange that web frameworks often omit basic functionality. Some do not include basic loggers, most do not offer an elegant log viewer. None seem to provide a comprehensive approach to page profiling out of the box. It’s a shame, if we were all looking at this kind of information from day-1 we could have avoided many pitfalls.

Play with our profiler … today

Our profiler is open source, so is Data Explorer. All logged in users can experience the profiler first-hand by browsing the web site.

Ease of deployment

The profiler is neatly packaged up in a single DLL. No need to copy any CSS, JS or other files to get it all working. Internally we use the excellent razor view engine to code our profiling page,this is compiled on the fly from the embedded resource using this handy trick. Our CSS is all in awesome LESS, we translate it to CSS on the fly in JavaScript. All the resources are embedded into the DLL.

Profiling SQL is achieved by introducing a bespoke DbConnection that intercepts all DB calls. This interception only happens when a profiling session is in progress.

Profiling blocks are ludicrously cheap since we use a fancy trick around extension methods. You may call extension methods on null objects.

public static IDisposable Step(this MiniProfiler profiler, string name, ProfileLevel level = ProfileLevel.Info)
{
  return profiler == null ? null : profiler.StepImpl(name, level);
}

If there is no MiniProfiler in play the cost is a simple null check.

Hope you enjoy the MiniProfiler be sure to tweet a thank you to @marcgravell and @jarrod_dixon if it helps you.

Update: There is now a nuget package that will take care of wiring this up automatically see: http://nuget.org/List/Packages/MiniProfiler.MVC3

MiniProfiler seems to have upset quite a few people. One major complain we got when Scott Hanselman blogged about it, goes:

Why do you expect me to sprinkle all these ugly using (MiniProfiler.Current.Step("stuff")) throughout my code. This is ugly and wrong, profilers should never require you to alter your code base. FAIL."

Our usual retort is that you are able to properly control the granularity of your profiling this way. Sure you can introduce aspects using PostSharp or your favorite IOC container, however this method may also force you to refactor your code to accommodate profiling.

That said, I agree that sometimes it may seem tedious and unnecessary to chuck using statements where the framework can do that for us.

Here is a quick document on how you can get far more granular timings out-of-the-box without altering any of your controllers or actions.

Profiling Controllers

Automatically wrapping your controllers with MVC3 is actually quite trivial, it can be done with a GlobalActionFilter

For example:

public class ProfilingActionFilter : ActionFilterAttribute
{

    const string stackKey = "ProfilingActionFilterStack";

    public override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        var mp = MiniProfiler.Current;
        if (mp != null)
        {
            var stack = HttpContext.Current.Items[stackKey] as Stack<IDisposable>;
            if (stack == null)
            {
                stack = new Stack<IDisposable>();
                HttpContext.Current.Items[stackKey] = stack;
            }

            var prof = MiniProfiler.Current.Step("Controller: " + filterContext.Controller.ToString() + "." + filterContext.ActionDescriptor.ActionName);
            stack.Push(prof);

        }
        base.OnActionExecuting(filterContext);
    }

    public override void OnActionExecuted(ActionExecutedContext filterContext)
    {
        base.OnActionExecuted(filterContext);
        var stack = HttpContext.Current.Items[stackKey] as Stack<IDisposable>;
        if (stack != null && stack.Count > 0)
        {
            stack.Pop().Dispose();
        }
    }
}

You wire this up in Global.asax.cs

protected void Application_Start()
{
   // other stuff ...

   GlobalFilters.Filters.Add(new ProfilingActionFilter());
   RegisterGlobalFilters(GlobalFilters.Filters);

   /// more stuff 
}

Profiling Views

Views on the other hand a bit more tricky, what you can do is register a special “profiling” view engine that takes care of instrumentation:

public class ProfilingViewEngine : IViewEngine
{
   class WrappedView : IView
    {
        IView wrapped;
        string name;
        bool isPartial;

        public WrappedView(IView wrapped, string name, bool isPartial)
        {
            this.wrapped = wrapped;
            this.name = name;
            this.isPartial = isPartial;
        }

        public void Render(ViewContext viewContext, System.IO.TextWriter writer)
        {
            using (MiniProfiler.Current.Step("Render "  + (isPartial?"partial":"")  + ": " + name))
            {
                wrapped.Render(viewContext, writer);
            }
        }
    }

    IViewEngine wrapped;

    public ProfilingViewEngine(IViewEngine wrapped)
    {
        this.wrapped = wrapped;
    }

    public ViewEngineResult FindPartialView(ControllerContext controllerContext, string partialViewName, bool useCache)
    {
        var found = wrapped.FindPartialView(controllerContext, partialViewName, useCache);
        if (found != null && found.View != null)
        {
            found = new ViewEngineResult(new WrappedView(found.View, partialViewName, isPartial: true), this);
        }
        return found;
    }

    public ViewEngineResult FindView(ControllerContext controllerContext, string viewName, string masterName, bool useCache)
    {
        var found = wrapped.FindView(controllerContext, viewName, masterName, useCache);
        if (found != null && found.View != null)
        {
            found = new ViewEngineResult(new WrappedView(found.View, viewName, isPartial: false), this);
        }
        return found;
    }

    public void ReleaseView(ControllerContext controllerContext, IView view)
    {
        wrapped.ReleaseView(controllerContext, view);
    }
}

You wire this up in Global.asax.cs:

protected void Application_Start()
{
   // stuff 

   var copy = ViewEngines.Engines.ToList();
   ViewEngines.Engines.Clear();
   foreach (var item in copy)
   {
       ViewEngines.Engines.Add(new ProfilingViewEngine(item));
   }

   // more stuff
}

The trick here is that we return a special view engine that simply wraps the parent and intercepts calls to Render.

The results speak for themselves:

Before

before

After

after

There you go, with minimal changes to your app you are now able to get a fair amount of instrumentation. My controller in the above sample had a System.Threading.Thread.Sleep(50); call which is really easy to spot now. My slow view had a System.Threading.Thread.Sleep(20); which is also easy to spot.

The caveat of this approach is that we now have another layer of indirection which has a minimal effect on performance. We will look at porting in these helpers to MiniProfiler in the near future.

Happy profiling, Sam

WARNING

Pretty much all the results / conclusions here are off. Turns out ASP.NET MVC has some pretty aggressive caching that is disabled in certain conditions:

protected VirtualPathProviderViewEngine() {
    if (HttpContext.Current == null || HttpContext.Current.IsDebuggingEnabled) {
        ViewLocationCache = DefaultViewLocationCache.Null;
    }
    else {
        ViewLocationCache = new DefaultViewLocationCache();
    }
}

HttpContext.Current.IsDebuggingEnabled will be set to true if your web.config contains

<configuration>
   <system.web>
     <compilation debug="true"  ...

This is totally unrelated to you compiling your project in release or not.

If there is one takeaway point here it is that you should have an admin stats page in your web app that draws a big red box somewhere if HttpContext.Current.IsDebuggingEnabled is set to true.

Unfortunately, when I ran my tests for this blog post in release mode, the default web project did not amend the compilation node. Leading to these rather concerning and erroneous results.

I just deployed the big red box™ to all our admin pages in the SE network and confirmed we are not running with compilation debug … so I blame my entire analysis below on a poor test setup.

The results below are only relevant to DEBUG mode

A few months ago I was fighting some performance dragons at Stack Overflow. We had a page that renders a partial per answer and I noticed that performance was leaking around the code responsible for locating partials. The leak was tiny, only 0.3ms per call – but it quickly added up. I tweeted the guru for some advice and got this:

You see, Phil has seen this happen before. I am told the code that locates partials in the next version of ASP.NET MVC is way faster.

However, for now, there are some things you should know.

How long does it take to render a view?

Rendering views in ASP.NET MVC is a relatively straight forward procedure. First, all the ViewEngines are interrogated, in order, about the view in question using the FindView or FindPartialView methods the ViewEngines provide. The view engine may be interrogated twice, once for a cached result and another time for a non-cached result.

Next, if a view is found Render is called on the actual IView.

So, the process breaks down to view location and view rendering. The logical separation is quite important from a perf perspective. Finding can be fast and rendering slow or vica versa.

As it turns out the performance of your view location heavily depends on how explicit you are with your view name.

This matters less if you only have a single view to find, however often you may have hundreds of views to locate during construction of a single page, leading to pretty heavy performance leaks.

An illustrated example

Take this simple view.

@for (int i = 0; i < 30; i++)
{
    @Html.Partial("_ProductInfo",  "Product" + i )
}

@for (int i = 0; i < 30; i++)
{
    @Html.Partial(@"~/Views/Shared/_ProductInfo.cshtml",  "Product" + i )
}

When we enable profiling for this page we get some pretty interesting results:

first

VS.

second

The first sample, which happens to be the cleaner and easier to maintain code comes with a 18ms perf penalty.

On an interesting side-note non-partials are effected as well:

\\ 0.6ms faster than return View();
return View(@"~/Views/Home/Index.cshtml");

Does this issue affect me?

I recently added FindView profiling to MVC Mini Profiler. This will allow you to quickly determine the impact of view location within your page. This is done by intercepting the ViewEngines.

For those who want a look at the code, this is the trick I use:

private ViewEngineResult Find(ControllerContext controllerContext, string name, Func<ViewEngineResult> finder, bool isPartial)
{
    var profiler = MiniProfiler.Current;
    IDisposable block = null;
    var key = "find-view-or-partial";

    if (profiler != null)
    {
        block = HttpContext.Current.Items[key] as IDisposable;
        if (block == null)
        {
            HttpContext.Current.Items[key] = block = profiler.Step("Find: " + name);
        }
    }

    var found = finder();
    if (found != null && found.View != null)
    {
        found = new ViewEngineResult(new WrappedView(found.View, name, isPartial: isPartial), this);

        if (found != null && block != null)
        {
            block.Dispose();
            HttpContext.Current.Items[key] = null;
        }
    }

    if (found == null && block != null && this == ViewEngines.Engines.Last())
    {
        block.Dispose();
        HttpContext.Current.Items[key] = null;
    }

    return found;
}

The new “extra” bit of profiling will be available in the next version of Mini Profiler.

Epilogue

One very important take away is that every view engine in the pipeline can add to your perf impact. If most of your views are Razor, reordering the pipeline so you process Razor views first will result in some nice performance improvements. If you are not using a View Engine be sure to remove it from the pipeline.

Here is a screenshot of the default view engines (notice that WebFormsViewEngine is first):

default view engines

View Engines are interrogated twice, once for cached views and a second time for uncached views.

For example to only include Razor you could add:

ViewEngines.Engines.Clear();
ViewEngines.Engines.Add(new RazorViewEngine());

Happy profiling.

I got this email over the weekend:

oops

I created the page which is extra embarrassing. Mister performance created a big fat performance mess. It is not the most important page on the site, it is only hit a few 100 times a day at most – however – perfomance is a feature.

First thing I do when I see these issues is dig in with MiniProfiler.

perf

In seconds the reason for the big slowdown is isolated. The ugly looking query is LINQ-2-SQL doing its magic. The 700ms+ gap is LINQ-2-SQL generating SQL from an expression and starting to materialize objects. It takes it almost a second after that to pull all the objects out. This happens every damn hit.

Enough is enough. Time to port it to Dapper.

There is a snag though, the page is using some pretty fancy expression stuff.

var baseQuery = Current.DB.TagSynonyms as IQueryable<TagSynonym>;

if (filter == TagSynonymsViewModel.Filter.Active)
{
    baseQuery = baseQuery.Where(t => t.ApprovalDate != null);
}
else if (filter == TagSynonymsViewModel.Filter.Suggested)
{
    baseQuery = baseQuery.Where(t => t.ApprovalDate == null);
    if (!CurrentUser.IsAnonymous && !CurrentUser.IsModerator)
    {
        // exclude ones I can not vote on
        baseQuery = from ts in Current.DB.TagSynonyms
                    join tags in Current.DB.Tags on ts.TargetTagName equals tags.Name
                    join stats in Current.DB.UserTagTotals on
                        new { Id = tags.Id, UserId = CurrentUser.Id }
                        equals
                        new { Id = stats.TagId, UserId = stats.UserId }
                    where ts.ApprovalDate == null && stats.TotalAnswerScore > Current.Site.Settings.TagSynonyms.TagScoreRequiredToVote
                    select ts;
    }
}

switch (tab.Value)
{
    case TagSynonymsViewModel.Tab.Newest:
        baseQuery = baseQuery.OrderByDescending(s => s.CreationDate);
        break;
    case TagSynonymsViewModel.Tab.Master:
        baseQuery = baseQuery.OrderBy(t => t.TargetTagName).ThenBy(t => t.AutoRenameCount);
        break;
    case TagSynonymsViewModel.Tab.Synonym:
        baseQuery = baseQuery.OrderBy(t => t.SourceTagName).ThenBy(t => t.AutoRenameCount);
        break;
    case TagSynonymsViewModel.Tab.Votes:
        baseQuery = baseQuery.OrderByDescending(t => t.Score).ThenBy(t => t.TargetTagName).ThenBy(t => t.AutoRenameCount);
        break;
    case TagSynonymsViewModel.Tab.Creator:
        baseQuery = (
                        from s in baseQuery
                        join users1 in Current.DB.Users on s.OwnerUserId equals users1.Id into users1temp
                        from users in users1temp.DefaultIfEmpty()
                        orderby users == null ? "" : users.DisplayName
                        select s
                    );
        break;
    case TagSynonymsViewModel.Tab.Renames:
        baseQuery = baseQuery.OrderByDescending(t => t.AutoRenameCount).ThenBy(t => t.TargetTagName);
        break;

    default:
        break;
}

if (search != null)
{
    baseQuery = baseQuery.Where(t => t.SourceTagName.Contains(search) || t.TargetTagName.Contains(search));
}

var viewModel = new TagSynonymsViewModel();
viewModel.CurrentTab = tab.Value;
viewModel.CurrentFilter = filter.Value;

viewModel.TagSynonyms =
    (
    from synonym in baseQuery
    join sourceTagsTemp in Current.DB.Tags on synonym.SourceTagName equals sourceTagsTemp.Name into sourceTagsTemp1
    join targetTagsTemp in Current.DB.Tags on synonym.TargetTagName equals targetTagsTemp.Name into targetTagsTemp1
    from sourceTag in sourceTagsTemp1.DefaultIfEmpty()
    from targetTag in targetTagsTemp1.DefaultIfEmpty()
    where filter != TagSynonymsViewModel.Filter.Merge || (synonym.ApprovalDate != null && sourceTag != null && sourceTag.Count > 0)
    select new TagSynonymsViewModel.TagSynonymRow() { SourceTag = sourceTag, TargetTag = targetTag, TagSynonym = synonym }
    ).ToPagedList(page.Value, pageSize.Value);
return viewModel;

Depending on the input parameters the code could be running one of 30 or so variations of the same SQL. Translating this to SQL without an API that allows you to compose a query can easily result in spaghetti code that is very hard to maintain. You need to attach different parameters to the query, depending on the inputs. You need to be able to sort it in many ways. You need to be able to page through the results.

To overcome this mess I created a new SqlBuilder that is somewhat inspired by PetaPoco’s one. It lives in Dapper.Contrib. Even in its early prototype phase it is incredibly powerful:

var builder = new SqlBuilder();

int start = (page.Value - 1) * pageSize.Value + 1;
int finish = page.Value * pageSize.Value;

var selectTemplate = builder.AddTemplate(
@"select X.*, st1.*, tt1.*, u1.* from 
(
select ts.*, ROW_NUMBER() OVER (/**orderby**/) AS RowNumber from TagSynonyms ts 
left join Tags st on SourceTagName = st.Name 
left join Tags tt on TargetTagName = tt.Name
/**leftjoin**/
/**where**/
) as X 
left join Tags st1 on SourceTagName = st1.Name 
left join Tags tt1 on TargetTagName = tt1.Name
left join Users u1 on u1.Id = X.OwnerUserId
where RowNumber between @start and @finish", new { start, finish }
);

var countTemplate = builder.AddTemplate(@"select count(*) from TagSynonyms ts 
left join Tags st on SourceTagName = st.Name 
left join Tags tt on TargetTagName = tt.Name
/**leftjoin**/
/**where**/");


if (filter == TagSynonymsViewModel.Filter.Active)
{
    builder.Where("ts.ApprovalDate is not null");
}
else if (filter == TagSynonymsViewModel.Filter.Suggested)
{
    builder.Where("ts.ApprovalDate is null");

    if (!CurrentUser.IsAnonymous && !CurrentUser.IsModerator)
    {
        builder.Where(@"ts.TargetTagName in (
select Name from Tags where Id in 
(select Id from UserTagTotals where UserId = @CurrentUserId and TotalAnswerScort > @TagScoreRequiredToVote)
)", new { CurrentUserId = CurrentUser.Id, Current.Site.Settings.TagSynonyms.TagScoreRequiredToVote});
    }
}

switch (tab.Value)
{
    case TagSynonymsViewModel.Tab.Newest:
        builder.OrderBy("ts.CreationDate desc");
        break;
    case TagSynonymsViewModel.Tab.Master:
        builder.OrderBy("ts.TargetTagName asc, ts.AutoRenameCount desc");
        break;
    case TagSynonymsViewModel.Tab.Synonym:
        builder.OrderBy("ts.SourceTagName asc, ts.AutoRenameCount desc");
        break;
    case TagSynonymsViewModel.Tab.Votes:
        builder.OrderBy("ts.Score desc, TargetTagName asc, AutoRenameCount desc");
        break;
    case TagSynonymsViewModel.Tab.Creator:
        builder.LeftJoin("Users u on u.Id = ts.OwnerUserId");
        builder.OrderBy("u.DisplayName");
        break;
    case TagSynonymsViewModel.Tab.Renames:
        builder.OrderBy("ts.AutoRenameCount desc, ts.TargetTagName");
        break;

    default:
        break;
}

if (search != null)
{
    builder.Where("(SourceTagName like @search or TargetTagName like @search)", new { search = "%" + search + "%"});
}

if (filter.Value == TagSynonymsViewModel.Filter.Merge)
{
    builder.Where("ApprovalDate is not null and isnull(st.Count,0) > 0");
}

var viewModel = new TagSynonymsViewModel();
viewModel.CurrentTab = tab.Value;
viewModel.CurrentFilter = filter.Value;

int count = Current.DB.Query<int>(countTemplate.RawSql, countTemplate.Parameters).First();
var rows = Current.DB.Query<TagSynonym, Tag, Tag, User, TagSynonymsViewModel.TagSynonymRow>(
    selectTemplate.RawSql, 
    (ts,t1,t2,u) => 
    { 
        var row = new TagSynonymsViewModel.TagSynonymRow { TagSynonym = ts, SourceTag = t1, TargetTag = t2 };
        row.TagSynonym.User = u;
        return row;
    },
    selectTemplate.Parameters);

var list = new PagedList<TagSynonymsViewModel.TagSynonymRow>(rows, page.Value, pageSize.Value, 
    forceIndexInBounds: true, prePagedTotalCount: count);
viewModel.TagSynonyms = list;
return viewModel;

Not only is resulting code is both simpler and way faster it is also much more flexible. I no longer need to juggle LINQ’s horrible LEFT JOIN semantics in my head. I can easily tell when I am attaching parameters and how.

This new implementation retained all of the complex conditional selection and counting logic and happens to be 2 or so orders of magnitude faster.

How the SqlBuilder works

The SqlBuilder allows you to generate N SQL templates from a composed query.

Here is a trivial example:

var builder = new SqlBuilder(); 
var count = builder.AddTemplate("select count(*) from table /**where**/");
var selector = builder.AddTemplate("select * from table /**where**/ /**orderby**/"); 
    
builder.Where("a = @a", new {a = 1});
// defaults to composing with AND
builder.Where("b = @b", new {b = 2})
builder.OrderBy("a");
// defaults to composing with , 
builder.OrderBy("b"); 
    
var count = cnn.Query(count.RawSql, count.Parameters).Single();
var rows = cnn.Query(selector.RawSql, selector.Parameters);
    
// Same as: 
    
var count = cnn.Query("select count(*) from table where a = @a and b = @b", new {a=1,b=1});
var rows = cnn.Query("select * from table where a = @a and b = @b order by a, b", new {a=1,b=1});

This API is not finalized. However, this simple templating solution allows us to solve most of the complex need for expressions and still have clean and maintainable code.

We dug ourselves into this performance mess, we are slowly paying back the technical debt of using LINQ-2-SQL. Are you?

ASP.NET has the infamous yellow screen of death. Rumor is that it was written by “The Gu” himself. It is an incredibly useful page that allows you to quickly and easily determine why stuff went wrong, while debugging your code. You get a stack trace, including the line of code and context for your error.

There is a slight snag though. When diagnosing SQL errors this page is lacking. Even though you are provided with the exception SQL Server returned, you do not get the SQL that executed nor do you get the parameters. Often I found myself setting break points and attaching, just to get at the SQL. This is very wasteful.

So I thought … why not extend the error page. Turns out this is fairly tricky.

There are two very big problems. Firstly, DbException and derived classes do not have any idea what the CommandText is. Secondly, extending the yellow screen of death is totally undocumented.

Nonetheless, I care not for such limitations. I wanted this:

rich error

Step 1 – storing and formatting the SQL statement

The custom DbConnection we use with MiniProfiler makes tracking the failed SQL simple.

First, we extend the ProfiledDbConnection:

public sealed class RichErrorDbConnection : ProfiledDbConnection
{
#if DEBUG
        DbConnection connection;
        MiniProfiler profiler;
#endif

        public RichErrorDbConnection(DbConnection connection, MiniProfiler profiler)
            : base(connection, profiler)
        {
#if DEBUG
            this.connection = connection;
            this.profiler = profiler;
#endif
        }


#if DEBUG
        protected override DbCommand CreateDbCommand()
        {
            return new RichErrorCommand(connection.CreateCommand(), connection, profiler);
        }
#endif
}

Next we implement a class that intercepts the exceptions and logs the SQL.

public class RichErrorCommand : ProfiledDbCommand
{ 
    public RichErrorCommand(DbCommand cmd, DbConnection conn, IDbProfiler profiler) : base(cmd, conn, profiler)
    {
    }

    void LogCommandAsError(Exception e, ExecuteType type)
    {
        var formatter = new MvcMiniProfiler.SqlFormatters.SqlServerFormatter();
        SqlTiming timing = new SqlTiming(this, type, null);
        e.Data["SQL"] = formatter.FormatSql(timing);
    }

    public override int ExecuteNonQuery()
    {
        try
        {
            return base.ExecuteNonQuery();
        }
        catch (DbException e)
        {
            LogCommandAsError(e, ExecuteType.NonQuery);
            throw;
        }
    }

    protected override DbDataReader ExecuteDbDataReader(CommandBehavior behavior)
    {
        try
        {
            return base.ExecuteDbDataReader(behavior);
        }
        catch (DbException e)
        {
            LogCommandAsError(e, ExecuteType.Reader);
            throw;
        }
    }

    public override object ExecuteScalar()
    {
        try
        {
            return base.ExecuteScalar();
        }
        catch (DbException e)
        {
            LogCommandAsError(e, ExecuteType.Scalar);
            throw;
        }
    }
}

During debug we now make sure we use RichErrorDbConnection as our db connection.

Step 2 – extending the error page

This was fairly tricky to discover, it lives in global.asax.cs:

protected void Application_Error(object sender, EventArgs e)
{
#if DEBUG 
    var lastError = Server.GetLastError();
    string sql = null;

    try
    {
        sql = lastError.Data["SQL"] as string;
    }
    catch
    { 
        // skip it
    }

    if (sql == null) return;

    var ex = new HttpUnhandledException("An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.", lastError);

    Server.ClearError();

    var html = ex.GetHtmlErrorMessage();
    var traceNode = "<b>Stack Trace:</b>";
    html = html.Replace(traceNode, @"<b>Sql:</b><br><br>
    <table width='100%' bgcolor='#ffffccc'>
    <tbody><tr><td><code><pre>" + sql + @"</pre></code></td></tr></tbody>
    </table><br>" + traceNode);

    HttpContext.Current.Response.Write(html);
    HttpContext.Current.Response.StatusCode = 500;
    HttpContext.Current.Response.Status = "Internal Server Error";
    HttpContext.Current.Response.End();

#endif
}

The trick here is that we use a string replace to yank in the new chunk of html.

I hope that future releases of the platform make both of the hacks easier to implement. It would be awesome if MVC4 shipped a template error page you can just edit. It would be awesome if the base ado.net interfaces provided a means of interception without needing a full re-implementation.

EDIT Just updated the code sample following the comments by Keith Henry and Nigel, thanks!

Anyone who has a blog knows about the dirty little spammers, who toil hard to make the Internet a far worse place.

I knew about this issue when I first launched my blog, and quickly wired up akismet as my only line of defence. Over the years I got a steady stream of rejected spam comments with the occasional false-positive and false-negative.

Once a week I would go to the spam tab and comb through the mountains of spam to see if anything was incorrectly detected and approve, then nuke the rest.

Such a waste of time.

Akismet should never be your only line of protection.

Akismet is a web service that prides itself on the huge amount of blog spam it traps:

spam

It uses all sort of heuristics, machine learning algorithm, Bayesian inference and so on, to detect spam.

Every day people around the world are shipping it way over 31 million bits of spam for it to promptly reject. My experience is that the vast majority of comments on my blog were spam. I think this number is so high due to us, programmers, dropping the ball.

Automated methods of spam prevention can solve a large amount of your spam pain.

Anatomy of a spammer

Currently, the state-of-the-art for the sleaze-ball spammers on the Internet is very similar to what it was 10 years ago.

The motivation is totally unclear, how could a message advertising an indecipherable message be helping anyone make money?

The technique however is crystal clear.

A bunch or Perl/Python/Ruby scripts are running amuck, posting as many messages as possible on as many blogs as possible.

These scripts have been customised to workaround the various protection mechanisms that WordPress implemented and PhpBB implemented. Captcha solvers are wired in, known javascript traps worked around and so on.

However, these primitive programs are yet to run full headless web browsers. This means they have no access to the DOM, and they can not run JavaScript.

The existence of a full web browser should be your first line of defence

I eliminated virtually all the spam on this blog by adding a trivial bit of protection:

$(function(){
    $(".simple_challenge").each(function(){
      this.value = this.value.split("").reverse().join("");   
    });    
 });

I expect the client to reverse a random string I give it. If it fails to do so, it gets a reCAPTCHA. This is devilishly hard for a bot to cheat without a full JavaScript interpreter and DOM.

Of course if WordPress were to implement this, even as a plugin, it would be worked around using the monstrous evil-spammer-script and added to the list of 7000 hardcoded workarounds in the mega script of ugliness and doom.

My point here is not my trivial spam prevention that rivals FizzBuzz in its delicate complexity.

There are an infinite number of ways you can ensure your users are using a modern web browser. You can ask them to reverse, sort, transpose, truncate, duplicate a string and so on … and so on.

In fact you could generate JavaScript on the server side that runs a random transformation on a string and confirm that happens on the client.

Possibly this could be outsourced. You could force clients to make a JSONP call to 3rd party that shuffles and changes its algorithms on an hourly basis. Then make a call on the server to confirm.

reCAPTCHA should be your second line of defence

Notice how I said reCAPTCHA, not CAPTCHA. The beauty of the reCAPTCHA system is that it helps make the world a better place by digitising content that our existing OCR systems failed at. This improves the OCR software Google builds, it helps preserve old content, and provides general good. Another huge advantage is that it adapts to the latest advances in OCR and gets harder for the spammers to automatically crack.

and for humans

Though sometimes it can be a bit too hard for us humans.

CAPTCHA systems on the other hand are a total waste of human effort. Not only are many of the static CAPTCHA systems broken and already hooked up in the ubur-spammer script, your poor users are doing no good solving them.

There are a tiny fraction of users that seem to be obsessed with running JavaScript-less web browsers. Using addons such as NoScript to provide with a much “safer” Internet experience. I totally understand the reasoning, however these users can deal with some extra work. The general population have fully functioning web browsers and never need to hit this line of defence.

Throttles, IP bans and so on should be your last line of defence

No matter what you do at a big enough scale some bots will attack you and attempt to post the same comment over and over on every post. If the same IP address is going crazy all over your website the best protection is to ban it.

I am not sure where Akismet fits in

For my tiny blog, it seems, Akismet is not really helping out anymore. I still send it all the comments for validation. Mainly cause that is the way it always was. It has a secondary optional status.

My advice would be, get your other lines of defence up first, then think of possibly wiring up Akismet.

What happens when the filthy spammers catch up?

Someday, perhaps, the spammers will catch up, get a bunch of sophisticated developers and hack up chromium for the purpose of spamming. I don’t know. When and if this happens we still have another line of defence that is implementable today.

Headless web browsers can be thwarted

I guess, some day a bunch of “headless” web browsers will be busy ruining the Internet. A huge advantage of the new canvas APIs have, is that we can now confirm pixels are rendered to the screen with the getImageData API. Render a few colors to the screen, read them out and make sure it rendered properly.

Sure, this will trigger a reCAPTCHA for the less modern browsers, but we are probably talking a few years before the attack of the headless web browsers.

And what do we do when this fails?

Enter “proof of work” algorithms

We could require a second of computation from people who post comments on a blog. It is called a “proof of work” algorithm. Bitcoin uses such an algorithm. The concept is quite simple.

There are plenty of JavaScript implementations of hash functions.

You hand the client a random string to hash eg: ABC123
It appends a random nonce to the string and hashes eg: ABC123!1
If the hash starts with 000 or some other predefined rule, the client stops.
Otherwise increase the nonce are repeat step 3: eg: ABC123!2

This means you are forcing the client to do a certain amount of computation prior to allowing it to post a comment, this can heavily impact any automated processes busy destroying the Internet. It means they need to run more computers on their quest of doom, which costs them more money.

There is no substitute for you

Sure, a bunch of people can always run sophisticated attacks that force you to disable comments on your blog. It’s the sad reality. If you abandon your blog for long enough it will fill up with spam.

That said, if we required all people leaving comments have a full working web browser we would drastically reduce the amount of blog spam.

A few weeks ago we started noticing a fairly worrying trend at Stack Overflow. Question:Show, the most active of pages on Stack Overflow, started slowly trending up on render times. This did not happen at once, the performance leak was not huge.

We noticed this by looking at our HAProxy logs which are stored in a special Sql Server instance thanks to Jarrod. This database contains render times so we can quickly aggregate it, and catch worrying trends.

During the time we were browsing the site, occasionally, we would catch strange stalls before any of the code in our controllers fired.

gap

Getting big gaps like the one above was rare, however we consistently saw 5ms stalls before any code fired off on our side.

We thought of adding instrumentation to figure out more about these leaks, but could not figure out what needed instrumenting.

CPU Analyzer to the rescue

I blogged about CPU Analyzer in the past but failed to properly communicate its awesomeness.

It is a sampling profiler designed to work in production. In general, traditional profiling methods fall over when digging into production issues or high CPU issues. CPU Analyzer works around this.

It attaches itself to a running process using the ICorDebug interface and gathers stack traces in a predefined rate. Additionally, it gathers CPU time for each thread while doing so.

At the end of this process it sorts all of the information it has into a somewhat readable text representation, showing the most “expensive” stack traces at the top.

This allows you to get a crude view of the current state of affairs and find likely sources of various performance gaps.

I ran CPU Analyzer a few times on the production instance of w3wp gathering 40 snapshots at a time at every 110ms. Warning: if you gather snapshots too rapidly you risk crashing your process, ICorDebug was not designed for profiling, ICorProfiler is – the ability to attach though is recent, I may rewrite it one day.

cpu-analyzer-net4.exe 12345 /s 40 /i 110

I found some interesting snapshots:

System.Web.Routing.RouteCollection.GetRouteData
System.Web.Routing.UrlRoutingModule.PostResolveRequestCache
System.Web.HttpApplication+SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute
[...]
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification
===> Cost (11232072)

And:

System.Text.RegularExpressions.Regex.IsMatch
System.Web.Routing.Route.ProcessConstraint
System.Web.Routing.Route.ProcessConstraints
System.Web.Routing.Route.GetRouteData
[...]
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification
===> Cost (7332047)

And

System.IO.File.FillAttributeInfo
System.IO.File.InternalExists
System.IO.File.Exists
System.Web.Hosting.MapPathBasedVirtualPathProvider.CacheLookupOrInsert
System.Web.Routing.RouteCollection.GetRouteData
[...]
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification
===> Cost (1560010)

From the three snapshots we can see the GetRouteData is busy running RegEx matches. In fact, these regular expressions are taking up 65% of the time in that method. Another 13% is consumed by File.Exist calls.

Now, these times are just samples, they may not represent reality especially for short operations. However, I gathered enough samples for a consistent picture to emerge.

I pinged Marcin and Phil, and got a very prompt and detailed reply from Marcin.

A crash course on ASP.NET routing

In an MVC3 scenario the UrlRoutingModule is responsible for looking at the path you type in your address bar, parsing it, and figuring out what information to pass on to MVC so it can dispatch the request.

Routes are denoted by fairly trivial Url patterns. For example the path: hello{{}}/{world} contains one static segment called hello{} and one wildcard placeholder named world; it can contain any text except for /.

The default process works somewhat like this:

RouteCollection checks with its MapPathBasedVirtualPathProvider to see if the existing path is an actual location on disk. If it is, routing is bypassed and the next module picks up. This check can be disabled with RouteTable.Routes.RouteExistingFiles = true;
MapPathBasedVirtualPathProvider has an internal cache (that sits on HttpRuntime.Cache which ensures it only performs a File.Exists or Directory.Exists call once a minute per unique string. This is configurable using the UrlMetadataSlidingExpiration node under hostingEnvironment in web.config.
Every route is tested for a match, in order, in a simple loop. Top to bottom. This test is performed in the ParsedRoute class. For every route tested a dictionary will be constructed and the virtual path split up into a list. If you have 600 routes and only the bottom one matches, 600 dictionaries and lists will be created and some other processing will happen per-route.
If a match is found, constraints are checked. There are 2 types of constraints, string constraints that are treated as a RegEx and tested for a match. And IRouteConstraint that are tested using the custom Match implementation. Both this types constraints live in a collection that is passed in using a RouteValueDictionary that is an IDictionary<string,object>. If any constraint fails to match, the route is skipped.
The router checks the Regex constraints using a call to Regex.IsMatch(rule,RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase), more about why you should avoid that soon.
Once the match is found it is passed on to MvcRouteHandler which creates an MvcHandler and is privy to the information the Route parsed out. Such as Controller and Action names.

We are very fussy about routes at Stack Overflow so we have tons

When you look at a question on Stack Overflow, the url: questions/1/my-awesome-question you do not see the url: Question/Show/1.

We are fussy. All our paths need to look, just-so. We pick nice friendly urls for all our various actions, which results in a pretty huge list of routes. We have upwards of 650 routes registered. The fact that we use attribute based routing makes creating lots of routes trivial.

Not only that, we are very fussy about the paths we route and make use of many RegEx constraints.

For example: for Question:Show we register the route: questions/{id}/{title?} and add the constraint ["id"] = "\d{1,9}". This ensures that questions/1 hits our question route and questions/bob does not. Without that constraint, the router would route such requests to MVC, which in turn would exception out. (since id is an int and is not optional)

The Regex cache and why it was impacting us

The Regex class has an internal cache for Regexes that are created using static method calls on the Regex class. You can specify its size using the CacheSize property:

By default, the cache holds 15 compiled regular expressions. Your application typically will not need to modify the size of the cache. Use the CacheSize property in the rare situation when you need to turn off caching or you have an unusually large cache.

This documentation on MSDN is confusing and misleading, the Regex cache has nothing to do with compiled queries as this snippet demonstrates, a compiled regex will take up a spot and a non compiled one will as well:

Action<string,Action> timeIt = (msg,action) =>
{
    var sw = Stopwatch.StartNew();
    action();
    sw.Stop();
    Console.WriteLine(msg + " took: {0:0.####}ms", sw.Elapsed.TotalMilliseconds);
};

Regex.CacheSize = 1;
// warm up engine
Regex.Match("12345", @"^(\d{0:9}...)$", RegexOptions.Compiled);   


timeIt("compiled first run", () => 
{
    Regex.Match("12345", @"^(\d{0:9})$", RegexOptions.Compiled);    
});

// Outputs: 0.9321ms

timeIt("compiled second run", () =>
{
    Regex.Match("12345", @"^(\d{0:9})$", RegexOptions.Compiled);
});

// Outputs: 0.077ms

timeIt("compiled third run, constructor checks cache", () =>
{
    new Regex( @"^(\d{0:9})$", RegexOptions.Compiled).Match("12345");
});

// Outputs: 0.0822ms

timeIt("not compiled first run", () =>
{
    Regex.Match("12345", @"^(\d{0:9})$");
});

// Outputs: 0.0856

timeIt("compiled third run", () =>
{
    Regex.Match("12345", @"^(\d{0:9})$", RegexOptions.Compiled);
});

// Outputs: 0.914ms - clearly dropped off cache

Console.ReadKey();

Static method calls on the Regex class are allowed to insert stuff into this cache. All regular expression construction (which happens once per static method call) checks the cache. The router makes a call to a static method on the Regex class when processing constraints.

Constructing a compiled regex can be up to 3 orders of magnitude slower than using a plain old interpreted Regex. However the cache can easily mask this, since it allows for reuse.

There is a big caveat.

We make heavy use of Regex.IsMatch and Regex.Replace calls in our code base, for non performance critical stuff. Performance critical regular expressions live in compiled static fields on various classes.

If you make heavy use of Regex static method calls, like we do, you are risking compiled regular expressions dropping off the cache at any time. The cost of recompiling a regular expression can be huge.

I strongly recommend you never pass the RegexOptions.Compiled to any of the static helpers, it is very risky. You have no visibility into the cache and do not know how various tool vendors will use it.

With all of this in mind, here is a list of optimizations we took. It eliminated the vast majority of the routing cost.

Firstly, we stopped using string based route constraints

The biggest optimization responsible for the majority of the performance gains was eliminating the use of string regular expression constraints.

var constraints = new Dictionary<string, IRouteConstraint>();

// instead of
// consraints.Add("Id", "\d{1,9}"); 

// we now do: 
constraints.Add("Id", new RegexConstraint("^(\d{1,9})$")

With trivially defined IRouteContraint:

public class RegexConstraint : IRouteConstraint, IEquatable<RegexConstraint>
{
   Regex regex;

   public RegexConstraint(string pattern, RegexOptions options = RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase)
   {
      regex = new Regex(pattern, options);
   }

   public bool Match(System.Web.HttpContextBase httpContext, Route route, string parameterName, RouteValueDictionary values, RouteDirection routeDirection)
   {
      object val;
      values.TryGetValue(parameterName, out val);
      string input = Convert.ToString(val, CultureInfo.InvariantCulture);
      return regex.IsMatch(input);
   }
}

This change alone eliminated all the “jittering”. After this, finding routes took a consistent amount of time.

Next, we reordered our routes so the most used routes are on top

This is the second most important optimization. Routes that are only accessed 10 times a day should not be at the top of routes that are accessed millions of times a day. Every route check comes with a constant amount of overhead.

On startup, we query HAProxy and order our routing table so the most used routes are on the top. We have special rules that ensure the ordering is “safe” in cases where routes overlap, and ordering needs to be maintained.

Then, we turbo charged the GetRouteData call

Our routes, in general, follow a simple pattern. We have a static string followed by a dynamic catch all. So, for example we have: posts/{postid}/comments and q/{id}/{userid}. This pattern allows us to super charge GetRouteData.

class LeftMatchingRoute : Route
{
    private readonly string neededOnTheLeft;
    public LeftMatchingRoute(string url, IRouteHandler handler)
        : base(url, handler)
    {
        int idx = url.IndexOf('{');
        neededOnTheLeft = "~/" + (idx >= 0 ? url.Substring(0, idx) : url).TrimEnd('/');
    }
    public override RouteData GetRouteData(System.Web.HttpContextBase httpContext)
    {
        if (!httpContext.Request.AppRelativeCurrentExecutionFilePath.StartsWith(neededOnTheLeft, true, CultureInfo.InvariantCulture)) return null;
        return base.GetRouteData(httpContext);
    }
}

This means that we can bypass some of the internal parsing. For perspective, we are talking about a reduction from 5ms to match the last route in the table down to 0.4ms, for a routing table with 650 routes.

Lastly, we Disabled the file checks for static content

You can easily disable the file checks the router performs:

// in Global.asax.cs - Application_Start 
RouteTable.Routes.RouteExistingFiles = true;

Trouble is, that all static content now will have do a full router sweep and most likely result in a 404.

To work around this issue we perform the matching earlier.

In web.config we register our special router:

<httpModules>
  <remove name="UrlRoutingModule-4.0"/>
  <add name="UrlRoutingModule-4.0" type="StackOverflow.Helpers.ProfiledUrlRoutingModule"/>
</httpModules>

and

<system.webServer>
<modules runAllManagedModulesForAllRequests="true">
      <remove name="UrlRoutingModule-4.0"/>
      <add name="UrlRoutingModule-4.0" type="StackOverflow.Helpers.ProfiledUrlRoutingModule"/>
</modules> 
</system.webServer>

Then we define:

public class ProfiledUrlRoutingModule : UrlRoutingModule
{
    public override void PostResolveRequestCache(HttpContextBase context)
    {
        if (context.Request.Path.StartsWith("/content/", StringComparison.OrdinalIgnoreCase)) return;

        using (MiniProfiler.Current.Step("Resolve route"))
        {
            base.PostResolveRequestCache(context);
        }
    }
}

This means that any path starting with /content/ bypasses the module. It also allows us to hook up MiniProfiler so we can track how fast our routing is.

This also increases security. We no longer need to worry about any leftover files that are not in the content directory. For content to be served statically, it must live in the content directory. We serve robots.txt from a dynamic route.

Final Notes

These optimizations reduced routing overhead for our hottest path, showing questions, from 5ms with 30ms spikes to 0.1 ms. A huge win.

That said, it is our use of the framework that caused many of these issues to arise. In general the router is not a bottleneck. If you think it may be – start by profiling your routing using MiniProfiler. See what is slow, then implement any or none of these optimizations.

Big thank you to Marc Gravell for coming up with the ProfiledUrlRoutingModule the LeftMatchingRoute and the clever hack to reorder routes by real world usage. Big thank you to Marcin for being helpful and pointing us at the right direction.

Note I am assured by Marcin that the Regex constraint issue is fixed in System.Web in .NET 4.5

Recently Marc blogged about some performance optimisations we implemented at Stack Overflow to work around some Garbage Collection (GC) issues.

This post evoked many emotions and opinions both on his Blog and on Hacker News. In this post I would like to cover some of the history involved with this particular problem, explain how we found it and eventually fixed it.

Jeff has a stopwatch in his brain.

When Stack Exchange developers look at our sites we get a very different view. In particular, on every page we visit there is a little red box that shows us how long it took the page to render, and other goodies. We run MiniProfiler in production, performance is a feature.

One day Jeff started noticing something weird:

Once in a while I visit a page and it takes a whole second to render, but MiniProfiler is telling me it only took 30ms. What is going on?

If the problem is not measured locally, it must be a network problem or an HAProxy we deduced. We were wrong.

I responded: “try editing your hosts file, perhaps there is a DNS issue”. This fixed nothing.

In hindsight we realised this only happened 1 out of 50 or so times he visited the page, which correlates with my stopwatch theory.

But, I am jumping a bit ahead of myself here, let me tell you a bit of history first.

The tag engine of doom

I love SQL. I am pretty good at it. I know how to think in sets and pride myself in some of the creative solutions I come up with.

One day, I discovered that we had a problem that simply can not be solved efficiently in SQL. Square peg, round hole.

We allow people to cut through our list of 2.2 million questions in 6 different sort orders. That is a problem that is pretty trivial to solve in SQL. We also present users with an aggregation of related tags on each of our sorts, something that is also reasonably simple to denormalize and cache in SQL.

However, stuff gets tricky when you start allowing rich filtering. For example if you are interested in SQL performance questions you can easily filter down the list and maintain all your sort orders.

Historically we found that the most efficient way for SQL Server to deal with these queries was using the full text search (FTS). We tried many different techniques, but nothing came close to the full text engine.

We would run queries such as this to get lists of tags to build a related tag list:

SELECT Tags
    FROM Posts p
    WHERE p.PostTypeId = 1 
    AND p.DeletionDate IS NULL 
    AND p.ClosedDate IS NULL 
    AND p.LockedDate IS NULL  
    AND CONTAINS(Tags, 'ésqlà AND  éperformanceà')

A similar query was used to grab question ids.

There were a series of problems with this approach:

It did not scale out, it placed a significant load on the DB.
It caused us to ferry a large amount of information from the DB to the app servers on a very regular basis.
SQL Server FTS does not play nice with existing indexes. The FTS portion of a query always needs to grab the whole result set it needs to work on, prior to filtering or ordering based on other non FTS indexes. So, a query asking for the top 10 c# .net questions was forced to work through 50k results just to grab 10.

To solve these issues we created the so called tag engine. The concept is pretty simple. Each web server maintains a shell of every question in the site, with information about the tags it has, creation date and other fields required for sorting. This list is updated every minute. To avoid needing to scan through the whole list we maintain various pre-sorted indexes in memory. We also pre-calculate lists of questions per tag.

We layer on top of this a cache and a bunch of smart algorithms to avoid huge scans in memory and ended up with a very fast method for grabbing question lists.

And by fast, I mean fast. The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.

Prior to the introduction of the tag engine the question listing pages on Stack Overflow were the single biggest bottleneck on the database. This was getting worse over the months and started impacting general unrelated queries.

There was another twist, the tag engine was always very close to “lock free”. When updating the various structures we created copies of the engine in-memory and swapped to the updated copy, discarding the out-of-date copy.

The worst possible abuse of the garbage collector

The .NET garbage collector like Java’s is a generational garbage collector. It is incredibly efficient at dealing with objects that have a short life span (provided they are less than 85000 bytes in size). During an objects life it is promoted up the managed heaps from generation 0 to 1 and finally to generation 2.

The .NET garbage collector runs generation 0 sweeps most frequently, generation 1 sweeps less frequently and generation 2 sweeps least frequently. When the Server GC in .NET 4.0 runs a generation 2 GC (which is a full GC) it has to suspend all managed threads. I strongly recommend reading Rico’s excellent Garbage Collector Basics and Performance Hints. .NET 4.5 will improve this situation, however, even with the background GC in .NET 4.5 in certain cases the GC may suspend threads, a fact that is confirmed by Maoni, the chief GC architect, on her Channel 9 talk.

If you happen to have a data structure that is full of internal references that allocates a huge amount of objects that just make it into generation 2 when you abandon them, you are in a pickle.

Your blocking GC is now forced to do lots of work finding the roots of the objects. Then it needs to deallocated them, and possibly defragment the heap.

Our initial tag engine design was a GC nightmare. The worst possible type memory allocation. Lots of objects with lots of references that lived just enough time to get into generation 2 (which was collected every couple of minutes on our web servers).

Interestingly, the worst possible scenario for the GC is quite common in our environment. We quite often cache information for 1 to 5 minutes.

Diagnosing the problem

This issue was first detected by Kyle in July, we performed a few optimisations assuming this was CPU related and heavily reduced CPU usage on the tag engine as a side effect we also reduced memory churn which reduced the impact a bit. At the time we did not assume we had a GC issue, instead we assumed there was a rogue thread slowing everything down.

Fast forward to about 3 weeks I enabled some basic profiling on our web servers that I had baked in a while ago:

basic

This report is based on a simple Stopwatch that starts at the beginning of a request and finished at the end. It measures the web servers request duration and puts it into neat time buckets partitioned by controller action.

This report showed me something rather odd. The time the web server was reporting was wildly different to the time we were measuring on our load balancer. It was about 60ms off on average.

I knew this cause of Jarrod’s awesome work. We store all our HAProxy web logs in a SQL Server instance. This has been a godsend when it comes to data analysis. I ran a simple average on the time HAProxy observed and noticed the discrepancy right away.

When I graphed the response time, using free and open source Data Explorer, for any route on our site we noticed spikes.

spikes

The above graph demonstrates how a limited number of requests would “spike” in response time many of which took longer than 1 second.

The frequency of the spikes correlated to the Gen 2 Collections performance counter.

Equipped with this feedback we set out on a mission to decrease the spikes which today look like this (for the exact same number of requests).

after

These spikes had a very interesting characteristics. We now store time measured in ASP.NET as well in our logs, so can graph that as well.

with asp.net

ASP.NET did not even notice that the spikes happen. This may indicate that ASP.NET is aware a GC is about to happen and is careful to serve out all its existing requests prior to running a Gen 2 GC. Regardless, the fact MiniProfiler could not catch the spikes correlated with our empirical evidence.

Note the blue “spikes” in the graph above are caused by a process that rebuilds related question lists. In general this is done in a background thread, however in rare conditions it may happen on first render. The GC spikes are the areas in the graph where there is a pile of yellow dots on top of a blue line. Clearly, it is not that problematic in this graph, it was created on data collected after we heavily improved the situation.

.NET Memory Profiler

I have always been a huge fan of .NET Memory Profiler. It is an awesome tool. Over the years Andreas has kept up with all the latest features. For example, the latest version allows you to attach to .NET 4 application using the new ICorProfiler attach API.

During the process of optimisation, after every round of changes, I would pull a server off the load balancer and analyze the managed heaps:

memory

For example: the profile above was taken after we refactored the tag engine. I discovered another spot with 500k objects that were alive for only a couple of minutes.

.NET Memory Profiler allows you to gather and compare snapshots of memory, so you can properly measure churn in your managed heaps. Churn in your generation 2 managed heap or large object heap is a major source of GC “freezes”.

Techniques for mitigating GC cost

When I fully understood this issue I quickly realised the solution to this problem.

Why should the tag engine cause requests that do not even consume its services to slow down. That is crazy talk. The tag engine should be isolated in its own process, then we can use a communication channel to query it and even distribute it amongst several machines issuing simultaneous queries from our webs to mitigate GC freezes. The first tag engine to respond will be served to the user. Problem solved.

When I pitched this to the team there was violent objection. Now we are going to need to maintain an API. Deploys are now going to be more complex. The list goes on. Services, SLAs and the rest of Yegge’s recommended cures for the API battle was not going to fly here.

Arguments aside, I did realise that by isolating you are only limiting the impact of the issue. You are not solving the problem. Also, there is a cost for all the new complexity this would introduce.

We also considered Microsoft’s recommendation:

If you do have a lot of latency due to heavy pause times during full garbage collections, there is a feature that was introduced in 3.5 SP1 that allows you to be notified when a full GC is about to occur. You can then redirect to another server in a cluster for example while the GC occurs.

I was not particularly keen on this suggestion:

We did not have a channel built that would allow us to coordinate this activity.
There was a non trivial risk that multiple machines would be forced to run full GC collections at the same time, leading to complex coordination efforts. (we were seeing a full GC every couple of minutes)
It does not solve the underlying problem

So Marc and I set off on a 3 week adventure to resolve the memory pressure.

The first step was rewriting the tag engine.

Index arrays as opposed to object arrays

In the tag engine we have to keep a global index of all our questions in one of six different sort orders. Trivially we used to use:

Question[] sortedByCreationDate;

The issue with this general pattern is that it is GC unfriendly. Having this array around means that each question now has one extra reference the GC is going to have to check.

Our tag engine is essentially immutable after construction or refresh. So we could simply replace the 2.2 million question list with.

int[] sortedByCreationDate;

An array containing indexes into the main immutable question list.

This is a pattern many application use internally, including Lucene.

For large objects sets that are immutable, this is a great solution. Nothing particularly controversial here.

Classes to Structs where needed

The above optimisation got us a fair distance however, we were left with a 2.2 million question list full of class pointers. The GC still needs to work real hard to crawling each Question object to deallocate when we swap in the new Question list. This was partially mitigated by reusing large chunks of our data during the refresh. However, it was not solved.

The cheapest solution from the GCs perspective was to have a simple struct array that it could allocate and deallocate in a single operation.

Structs however come with big warnings from Microsoft:

Do not define a structure unless the type has all of the following characteristics:
It logically represents a single value, similar to primitive types (integer, double, and so on). It has an instance size smaller than 16 bytes.
It is immutable.
It will not have to be boxed frequently.

Migrating these millions of objects to Structs was going to violate all these rules. The classes were bigger than 16 bytes, mutable during refresh and represented multiple values.

The unspoken reality is that Microsoft violate these rules in data structures we use daily like Dictionary.

Working efficiently with structs means you need a deep understanding of the C# semantics, for example, understanding this code should be second nature if you are to use structs.

var stuff = new MyStruct[10];
var thing = stuff[0];
thing.A = 1; 
Console.WriteLine(stuff[0].A); // returns 0 

stuff[0].A = 1;  
Console.WriteLine(stuff[0].A); // returns 1 

DoSomething(stuff[0]); 
Console.WriteLine(stuff[0].A); // returns 1 (can not do anything, it got a copy) 

DoSomethingElse(ref stuff[0]);
Console.WriteLine(stuff[0].A); // returns 2

You have to be real careful not to operate on copies of data as opposed to the real data. Assignments copy. With this in mind, when dealing with huge collections of entities you can use structs.

As a general rule: I would only consider porting to structs for cases where there is a large number (half a million or more) of medium lived objects. By medium lived, I mean, objects that are released shortly after they reach generation 2. If they live longer, you may be able to stretch it a bit. However, at some point, the sheer number of objects can be a problem, even if they are not deallocated. The days of 64 bit computing are upon us, many enterprise .NET apps consume 10s of gigs in the managed heaps. Regular 2 second freezes can wreak havoc on many applications.

Even with all the upcoming exciting changes with the GC, the reality is that doing a ton less work is always going to be cheaper than doing a lot of work. Even if the work is done in the background and does not block as much.

Huge medium to long term object stores in a generational GC require careful planning, so the same principles apply to Java (even if its GC is less blocking).

.NET and other generational garbage collected frameworks are incredibly efficient at dealing with short lived objects. I would not even consider converting any of our short lived classes to structs. It would not make one iota of a difference.

Avoiding large amounts of int arrays

On our Question object we have a tags array.

struct Question 
{
   int[] Tags {get; set;}
}

Even after we got rid of these 2.2 million refs we were still stuck with a large number of pointers to arrays that live in the struct.

To eliminate this we used a custom struct that pre-allocates 10 spots for tags and provides similar array semantics. Our implementation uses fixed int data[MaxLength]; as the internal backing store. However, you could implement something safe that is identical by having N int fields on the internal embedded struct.

The fixed size structure is a trade-off, we potentially take up more space, however avoid a lot of small memory allocations and deallocations.

I also experimented with Marshal.AllocHGlobal but managing lifetime for an embedded structure was way too complicated.

The above techniques are the primitives we use to implement our system, all of this is wrapped up in cleaner object oriented APIs.

In this spirit, if I were to implement a managed version of say memcached in C# I would use huge byte[] to store data for multiple entries. Even though performance of a managed memcached will never match a c++ implementation, you could design a managed store in C# that does not stall.

Conclusion

Some of our techniques may seem radical on first read. Others a lot less so. I can not stress how important it is to have queryable web logs. Without having proper visibility and the ability to analyze progress, progress can not be made.

If you face similar problems in a high scale .NET application you may choose the large block allocation solution, you could rewrite components in c++ or isolate problem services.

Here is a graph plotting our progress over the months, it shows the percentage of times Question/Show takes longer than 100ms, 500ms or 1000ms.

The remaining delays these days are mostly GC related, however they are way shorter than they used to be. We practically erased all the 1 second delays. Our load these days is much higher than it was months ago.

I take full credit for any mistakes in this article and approach, but praise should go to Marc for implementing many of the fixes and to Kyle and Jeff for communicating the urgency and scale of the problem and to Jarrod for making our web logs queryable.

I am glad Jeff has a stopwatch in his brain.

One of the top questions in the Dapper tag on Stack Overflow asks how to insert records into a DB using Dapper.

I provided a rather incomplete answer and would like to expand on it here.

Many ways to skin an INSERT

Taking a step back let’s have a look at how other ORMs and micro ORMs handle inserts. Note, I am using the term ORM quite loosely.

Assuming we have a simple table:

Products (
   Id int identity primary key,
   Name varchar(40), 
   Description varchar(max), 
   MoreStuff varchar(max) default('bla'))

This example may seem trivial, it is however quite tricky. The ORM should “know” that Id is special, it should not try to insert stuff there AND it should try to give the Id back to the consumer. Possibly, the ORM should be aware that MoreStuff has a default and it should yank out after the insert.

Let’s compare how a few toolkits fair:

Massive

Since Massive is dynamic, there is very little wiring you need to do up front:

public Products(string cnnName) : 
    base(cnnName, primaryKeyField: "Id")
{
}

Then to insert you do:

var row = Products.Insert(
      new {Name = "Toy", Description = "Awesome Toy"});

The returned row is populated with the data you passed in, additionally row.Id is populated with the value of @@identity (which really should be scope_identity())

This approach has quite a few advantages. Firstly, it does not muck around with your input by setting fields in a side effect. It also copes with the “Default column” problem by simply pretending it does not exist in the dynamic row. I think returning an error is preferable to returning incorrect info.

PetaPoco

The first of the static typed options I will look at is PetaPoco

The class definition is pretty simple:

[TableName("Products")]
[PrimaryKey("Id", autoIncrement = true)]
class Product
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
    public string Other { get; set; }
}

The insert API is pretty simple as well:

 var p = new Product { Name = "Sam", Description = "Developer" };
 object identity = db.Insert(p);

PetaPoco “knows” not to insert the value 0 into the Id column due to the decoration, it also decides to pull in the .Id property after the insert.

It populates the Other column that had a database default with null, which is probably not what we wanted.

Various DBs deal with sequences in different ways, SQL Server likes to return decimals, others, ints. An unfortunate side effect of this is that the API returns an object.

Entity Framework

Entity Framework is giant, it has a feature for every user, so this task should be quite simple, our POCO:

public class Product
{
    [Key]
    [DatabaseGenerated(DatabaseGeneratedOption.Identity)]
    public int Id { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
    public string Other { get; set; }
}

Our context:

class MyContext : DbContext 
{
     public DbSet<Product> Products {get; set;}
}

The insert:

var p = new Product { Name = "Sam", Description = "Developer" };
var ctx = new MyContext();
var p2 = ctx.Products.Add(p);
ctx.SaveChanges();

This API is rather simple, however is a bit odd. The .Add simply returns the input it was given. When you finally call SaveChanges both p and p2 will be “changed” to include the Id which is yanked in from the underlying table joined on SCOPE_IDENTITY. There is an annotation called DatabaseGeneratedOption.Computed that can explain our column is computed on the DB side, however there is no annotation that can explain it is “optionally” computed on the db side.

The trivial Dapper implementation

The simple way to do this with Dapper is:

var p = new Product { Name = "Sam", Description = "Developer" };
p.Id = cnn.Query<int>(@"insert Products(Name,Description) 
values (@Name,@Description) 
select cast(scope_identity() as int)", p).First();

Nothing is hidden from us, no side effects “rewrite” our product object. It is however pretty verbose. We are stuck writing SQL we could probably generate.

If we want to be “tricky” and solve the default column problem, we can:

var p = new Product { Name = "Sam", Description = "Developer" };
var columns = cnn.Query(@"insert Products(Name,Description) 
values(@Name,@Description)
select Other,Id 
from Products where Id = scope_identity()            
", new {p.Name, p.Description }).First();

p.Id = columns.Id;
p.Other = columns.Other;

The ORM cracks are starting to show

ORMs are very opinionated:

Should records be queued up for insert, or inserted on the fly?
Should the “context” object refer to a table (like massive) or a database like PetaPoco and EF?
Should the insert helper change stuff on the object you pass in?
Should you define table metadata using attributes or constructors? If you are using attributes, do they belong on the class or properties?
Should you deal with every little edge case?
Should you define a connection string in the application config? Should you be able to pass an open connection to the context constructor? Should the context object have lots of constructors? And so on.
Should you handle the ability to save graphs of objects in an ordered fashion?

In general the big ORMs try to please everyone, so you need a manual … a BIG manual … to deal with all this minutiae. Even tiny micro ORMs make a ton of decisions that may, or may not, be the ones you would make.

In a recent Hanselminutes Scott asked my colleague Demis: “Why is it that users shy away from frameworks Microsoft build?”. My EF specific answer is this:

It is not just that Dapper is at least twice as fast as EF when it comes to insert performance. It is that the one-size-fits-all approach does not offer a snug fit. I may, or may not care for the EF unit-of-work pattern. There may be other features in EF I do not like. I may have given up on LINQ as an interface to my DB, for example.

When ORMs compete, they often use a big table to describe every feature they have that the competition lacks. Unfortunately, this is a big driver for bloat and complexity, it leaves us developers confused. EF 4.3 now adds migrations, not only is EF offering two prescriptions for data access (code first vs “traditional” designer driven), it is also offering a prescription for data migration. I dislike this monolithic approach. I should be able to pick and choose the pieces I want. Microsoft should be able to rev the migration system without pushing a new version of EF. The monolithic approach also means releases are less frequent and harder to test.

When I choose <insert gigantic ORM name here>, I am choosing not to think. The ORM does all the thinking for me, and I assume they did it right. I outsourced my thinking. The unfortunate side-effect of this Non Thinking is that I now have to do lots of manual reading and learning of what the right way of thinking is.

The Dapper solution

In the past I “sold” Dapper as a minimal ultra fast data mapper. In retrospect the message I sent was lacking. Dapper is piece of Lego you can build other ORMs with. It is likely PetaPoco could be built on Dapper and not suffer any performance hit. Same goes for EF (except that it would be much faster). Dapper is like Rack for databases. It covers all the features ORMs need, including multi mapping, stored proc support, DbCommand reuse, multiple result sets and so on. It does it fast.

I built lots of helpers for Dapper, I blogged about the SqlBuilder.

Recently, I ported Stack Exchange Data Explorer to Dapper. And no, I was not going to hand craft every little INSERT statement, that is crazy talk. Instead, I wanted an API I like:

public class Product
{
   public int Id { get; set; }
   public string Name { get; set; }
   public string Description { get; set; }
   public string Other { get; set; }
}

public class MyDatabase : Database<MyDatabase>
{
   public Table<Product> Products{get;set;}
}

var db = MyDatabase.Init(cnnStr, commandTimeout: 100);

var product = new Product{Name="Toy",Description="AwesomeToy"};
product.Id = db.Products.Insert(new {product.Name, product.Description});

My interface is very opinionated, but the opinions are the best kind of opinions, my opinions. It works, it is simple. With just a few lines of code I could mimic any of the APIs I presented earlier. I chose not to solve the “default column” problem simply cause it was not a problem I had.

The whole implementation of this micro ORM is a single file. It is ultra fast (very close to raw Dapper performance) it uses IL generation in one spot. I do not intend for this little implementation to be “mainstream”. Instead, I hope that people borrow concepts from it when it comes to CRUD using Dapper or other little ORMs. If you want to have a play with it, I chucked it on nuget.

There are now 2 other APIs you can choose from as well (for CRUD) Dapper.Contrib and Dapper Extensions.

I do not think that one-size-fits-all. Depending on your problem and preferences there may be an API that works best for you. I tried to present some of the options. There is no blessed “best way” to solve every problem in the world.

For example, if you need an ultra fast way to insert lots of stuff into a SQL DB, nothing is going to beat SqlBulkCopy and you are going to need a custom API for that.

The UPDATE problem is a bit more tricky, I hope to write about it soon.

Reminder, script tags block rendering

It is common advice to move all your external JavaScript includes to the footer.

Recently, the movement for fancy JavaScript loaders like ControlJS, script.js and so on has also picked up steam.

The reason for this advice is sound and pretty simple. Nothing, Nada, Zilch below non asynchronous or deferred script tags gets rendered until the script is downloaded, parsed (perhaps compiled) and executed. With one tiny exception that is Delayed Script Execution in Opera. There are quite a few technical reasons why scripts block. You need a “stable” DOM while executing a script. Browsers need to make sure document.write works.

Often people avoid the async attribute on script tags cause they usually execute in the order they return this means you need to be more fancy about figuring out when the script is ready. The various JavaScript loaders out there give you a clean API.

Why is that jQuery synchronous script include stuck in the html ‘HEAD’ section?

jQuery solves the age old problem of figuring out when your document is ready. The problem is super nasty to solve in a cross browser way. It involves a ton of hacks that took years to formulate and polish. It includes weird and wonderful hacks like calling doScroll for legacy IE. To us consumers this all feels so easy. We can capture functions at any spot in our page, and run them later when the page is ready:

$(wowThisIsSoEasy); // aka. $(document).ready(wowThisIsSoEasy);

It allows you to arbitrarily add bits of rich functionality to your page with incredible ease. You can include little scripts that spruce up your pages from deep inside a nested partial. In general, people break the golden rule and include jQuery in the header. That is cause jQuery(document).ready is so awesome.

The jQuery tax

There are 3 types of tax you pay when you have a script in your header. The constant tax and the initial tax and the refresh tax.

The initial tax

The most expensive tax is the initial hit. Turns out more than 20% of the page views we get at Stack Overflow involve an non-primed cache and actual fetching of JavaScript files from our CDN. These are similar numbers to the ones reported by Yahoo years ago. The initial hit can be quite expensive. First, the DNS record for the CDN needs to be resolved. Next a TCP/IP connection needs to be established. Then the script needs to be downloaded and executed.

jquery

Browsers these days have a pack of fancy features that improve performance including: HTTP pipelining, speculative parsing and persistent connections. This often may alleviate some of the initial jQuery tax you pay. You still need your CSS prior to render, modern browsers will concurrently download the CSS and scripts due to some of the above optimisations. However, slow connections may be bandwidth constrained. If you are forced to download multiple scripts and CSS prior to render, they all need to arrive and run prior to render starting.

In the image below (taken from Stack Overflow’s front page) you can see how the screen rendering could have started a 100 or so milliseconds prior if jQuery was deferred – the green line:

render

An important note is that it is common to serve jQuery from a CDN be it Google or Microsoft. Often the CDN you use for your content may have different latency and performance to the CDN serving jQuery. If you are lucky Microsoft and Google are faster, if you are unlucky they are slower. If you are really unlucky there may be a glitch with the Google CDN that causes all your customers to wait seconds to see any content.

The refresh tax

People love clicking the refresh button, when that happens requests need to be made to the server checking that local resources are up to date. In the screenshot below you can see how it took 150ms just to confirm we have the right version of jQuery.

refresh

This tax is often alleviated by browser optimisations, when you check on jQuery you also check on your CSS. Asynchronous CSS is risky and may result in FOUC. If you decide to render CSS inline you may avoid this, but in turn have a much harder problem on your hands.

The constant tax

Since we are all good web citizens (or at least the CDNs are), we have expire headers which means we can serve jQuery from our local cache on repeat requests. When the browser sees jQuery in the header it can quickly grab it from the local cache and run it.

Trouble is even parsing and running jQuery on a modern computer with IE8 can take upwards of 100 milliseconds. From my local timings using this humble page on a fairly modern CPU i7 960, the time to parse and run jQuery heavily varies between browsers.

Chrome seems to be able to do it under 10ms, IE9 and Opera at around 20ms and Firefox at 80ms (though I probably have a plugin that is causing that pain). IE7 at over 100ms.

On mobile the pain is extreme, for example: on my iPhone 4S this can take about 80ms.

Many people run slower computers and slower phones. This tax is constant and holds up rendering every time.

Pushing jQuery to the footer

Turns out that pushing jQuery to the footer is quite easy for the common case. If all we want is a nice $.ready function that we have accessible everywhere we can explicitly define it without jQuery. Then we can pass the functions we capture to jQuery later on after it loads.

In our header we can include something like:

window.q=[];
window.$=function(f){
  q.push(f);
};

Just after we load jQuery we can pass all the functions we captured to the real ready function.

$.each(q,function(index,f){
  $(f)
});

This gives us access to a “stub” ready function anywhere in our page.

Or more concisely:

<script type='text/javascript'>window.q=[];window.$=function(f){q.push(f)}</script>

and

<script type="text/javascript">$.each(q,function(i,f){$(f)})</script>

Why have we not done this yet at Stack Overflow?

Something that may seem trivial on a tiny and humble blog may take a large amount of effort in a big app. For one, we need to implement a clean pattern for registering scripts at the page footer, something that is far from trivial. Further more we need to coordinate with third parties that may depend on more than a $ function or even god forbid document.write for ads.

The big lesson learned is that we could avoided this whole problem if we started off with my proposed helper above.

We spend an inordinate amount of time shaving 10% off our backend time but often forget the golden rule, in general the largest bottleneck is your front end. We should not feel powerless to attack the front end performance issues and can do quite a lot to improve JavaScript bottlenecks and perceived user performance.

Edit Also discussed on Hacker news thanks for the great feedback

Nine months ago I blogged about our MiniProfiler open source project. I used some hyperbole in the announcement blog. I goaded others to port MiniProfiler to their favourite web stack and accused us of living in the “stone age” of web profiling.

MiniProfiler attracted many lovers and some haters who either hated my marketing hyperbole speak or said that “real” profilers do not make you litter your code with superfluous profiling code. I responded with some helpers that allow you to automatically instrument your web apps. MiniProfiler does not prescribe profiling blocks, you can get very rich instrumentation with minimal changes to your code.

Exaggeration aside, awesome is infectious. PHP now has a port of MiniProfiler, Python has a port of MiniProfiler.

What problem is Mini Profiler Solving?

The main goal has always been, increase developer awareness around performance bottlenecks. Instead of simply “feeling” that a page “may” be slow. MiniProfiler tells you why it is slow. It is in your face telling you to remove the suck from your code.

With this in mind we made a few great evolutionary improvements to Mini Profiler.

Front end performance analysis

In 1968 Robert Miller wrote:

[Regarding] response to request for next page. […] Delays of longer than 1 second will seem intrusive on the continuity of thought

This was still true in the canonical Jacob Nielsen article 30 year later, it is still true today. We strive to have 100ms response time… everywhere.

However, the pesky speed of light thing keeps on getting in the way as does the stateless nature of the web.

As developers we often think that we only have control over performance problems in the server room, or equipment we own. The harsh reality is that the largest amount of pain is introduced in the front end, often the largest bang for buck is improving the front end.

MiniProfiler now allows you to measure your front end performance. This is accomplished by using Navigation Timings API. It is supported on Chrome, Firefox and IE9 (notably Opera and Safari are missing support for these APIs)

Additionally, you can add your own client side probes to measure the impact of your JavaScript includes and CSS. This allows you to easily share this information with your team, without needing fancy screen shots.

To add client side probes in an ASP.NET MVC app you would amend your master layout like this:

@using StackExchange.Profiling; 
....
<head>
@this.InitClientTimings();
@this.TimeScript("jQuery", 
    @<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>)

The following will insert a simple JavaScript probe framework into your page and wrap your includes with it.

<script type="text/javascript">mPt=function(){var t=[];return{t:t,probe:function(n){t.push({d:new Date(),n:n})}}}()</script>
<script type="text/javascript">mPt.probe('jQuery')</script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
<script type="text/javascript">mPt.probe('jQuery')</script>

Take this trace, for example, here is an initial request to the meta home page:

initial request to meta home page

Here is a repeat request:

trace with client timing API

We may have a false sense of security thinking that our page only took 13 or so ms server side. My actual perceived user performance was waiting a long time before I saw anything. For the first case I waited 1.5 seconds before the CSS was evaluated. For a repeat view I waited 423ms till the CSS finished evaluating and the page even started rendering.

Tools like firebug, chrome dev tools or IE dev tools do not clearly report these times:

chrome

Looking at the screenshot from chrome we may think, incorrectly, that putting this script in the header had no impact. Note that speed tracer does show these timings and more, however, is not ubiquitous.

So, what can you do with this kind of information?

You can invest time in pushing scripts to the footer and perhaps loading them async.
You can start building a case for getting a CDN or a better CDN.
You can look at minimizing scripts, CSS includes and trimming them down.
You can easily share this information with your team, without needing to take awkward screen shots – you simply share a link.
You can start thinking about crazy things, like replicating your database globally and serving pages from the closest location.
And so on.

UI less profiling

In the past when MiniProfiler was on, the UI was always there. This is perfect for developers but makes it impossible to gather information about requests that are not coming from developers.

You can now enable profiling on any page in a UI-less mode. To do so, simply render your MiniProfiler include and start a session.

To start a session use the same API you did in the past:

In your Global.asax.cs file:

// Tell MiniProfiler if the current request can see the UI. 
void Application_Start(object sender, EventArgs e) 
{
   MiniProfiler.Settings.Results_Authorize = httpRequest => IsUserAllowedToSeeMiniProfilerUI(httpRequest);
}

protected void Application_BeginRequest() 
{
   if (wouldLikeToGatherProfilingInfoAboutThisRequest)
   {
      profiler = MiniProfiler.Start();
   }
}

protected void Application_EndRequest()
{
   MiniProfiler.Stop();
}

The last thing in your master layout:

@if (wouldLikeToGatherProfilingInfoAboutThisRequest) {
   @MiniProfiler.RenderIncludes()
}

Experimental list everything UI

We added a simple UI to list the recent 100 sessions. This allows you to view session you did not participate in and have no links to. The UI, we are accepting patches to spruce it up.

You can view it by going to http://your-sites/mini-profiler-resources/results-index

list UI

This UI is not enabled by default. You will need to “allow” users to see it:

MiniProfiler.Settings.Results_List_Authorize = (request) =>
{
    // you may implement this if you need to restrict visibility of profiling lists on a per request basis 
    return true; // all requests are kosher
};

Script re-organisation

MiniProfiler always used jquery.tmpl for client side rendering. This plugin was “bolted on” to the current jQuery object. The trouble was that sometimes people decided to bolt on their own templating plugins and we stepped on each others feet.

The new version of MiniProfiler ships a stand alone version of jQuery and jquery.tmpl that are used in a .noCoflict() mode. This mean you can have whatever version of jQuery you wish and it will not conflict with MiniProfiler.

Defer everything

In the past we expected MiniProfiler to be included in the header of the page, the new design encourages you to place MiniProfiler.RenderIncludes() in your footer (if you don’t it will still attempt to load async). This means that we can accurately measure client side timings without interfering with your page render.

No longer MVCMiniProfiler

From launch we always allowed profiling of Web Forms and ASP.NET MVC apps. Mark Young submitted patches for improving EF support and added WCF support. Yet all projects where stuck importing an awkward namespace called MVCMiniProfiler. For the 2.0 release we shuffled around all the assembly names to use the StackExchange.Profiling namespace. This sends a better message to the community about the purpose of the MiniProfiler project.

When upgrading you need to replace:

using MVCMiniProfiler

With

using StackExchange.Profiling

Other changes

Results_Authorize used to take in a MiniProfiler object, now it does not.
MiniProfiler.EF now has a special method for initializing profiling for EF 4.2 Initialize_EF42
StackExchange.Profiling.Storage.IStorage has a new List that is used to query data in the list mode

Where Do I get it?

We uploaded MiniProfiler , MiniProfiler.EF and MiniProfiler.MVC to nuget. While we are finalizing the 2.0 release you will need to install it from the Package Manager Console:

edit MiniProfiler 2.0 has now been released

PM> Install-Package MiniProfiler

A call to action

MiniProfiler has no website, MiniProfiler could have better documentation, MiniProfiler has not been ported to Ruby-on-Rails, Django, Lift or whatever other awesome web framework you may be using.

Contact me if:

You would like to help built a site for MiniProfiler or get it a domain.
You would like pointers porting it to your web framework
You would like to take ownership of the MiniProfiler.WCF package (which is missing from nuget)

If there are any bugs you discover please report them here: http://code.google.com/p/mvc-mini-profiler/issues/list

If you would like to help improve our documentation edit our little tag wiki and make it awesome.

Patches are always welcome, personally I prefer getting them on github.

Big thanks to Jarrod who designed much of the MiniProfiler internals and the community at large for the patches and support that got us to a 2.0 release.

Sometimes we hear that crazy developer talk about some magical thing you can do that will increase performance everywhere by 30% (feel free to replace that percentage with whatever sits right for you).

In the past week or so I have been playing the role of “that guy”. The ranting lunatic. Some times this crazy guy throws all sorts of other terms around that make them sound more crazy. Like say: TCP or Slow Start or Latency … and so on.

So we ignore that guy. He is clearly crazy.

Why the web is slow?

Turns out that when it comes to a fast web the odds were always stacked against us. The root of the problem is that TCP and in particular the congestion control algorithm that we all use – Slow Start happens to be very problematic in the context of the web and HTTP.

Whenever I download a web page from a site there are a series of underlying events that need to happen.

A connection needs to be established to the web server (1 round trip)
A request needs to be transmitted to the server.
The server needs to send us the data.

Simple.

I am able to download stuff at about 1 Meg a second. It follows that if I need to download a 30k web page I only need two round trips, first one to establish a connection. And the second one to ask for the data and get it. Since my connection is SO fast I can grab the data in lightning speed, even if my latency is bad.

My round trip to New York (from Sydney Australia) takes about 310ms (give or take a few ms)

Pinging stackoverflow.com [64.34.119.12] with 32 bytes of data:
Reply from 64.34.119.12: bytes=32 time=316ms TTL=43

It may get a bit faster as routers are upgraded and new fibre is laid, however it is governed by the speed of light. Sydney to New York is 15,988KM. The speed of light is approx 299,792KM per second. So the fastest amount of time I could possibly reach New York and back would be 106ms. At least until superluminal communication becomes reality.

Back to reality, two round trips to grab a 30k page is not that bad. However, once you start measuring … the results do not agree with the unsound theory.

reality

The reality is that downloading 34k of data often takes upwards of a second. What is going on? Am I on dial-up? Is my Internet broken? Is Australia broken?

Nope.

The reality is that to reach my maximal transfer speed TCP need to ramp up the number of segments that are allowed to be in transit a.k.a. the congestion window. RFC 5681 says that once a connection starts up you are allowed to have maximum of 4 segments initially in transit and unacknowledged. Once they are acknowledged the window grows exponentially. In general the initial congestion window (IW) on Linux and Windows is set to 2 or 3 depending on various factors. Also the algorithm used to amend the congestion window may differ (vegas vs cubic etc) but usually follows the pattern of exponential growth compensating for certain factors.

Say you have an initial congestion window set to 2 and you can fit 1452 bytes of data in a segment. Assuming you have an established connection infinite bandwidth and 0% packet loss it takes:

1 round trip to get 2904 bytes, Initial Window (IW) = 2
2 round trips to get 8712 bytes, Congestion Window (CW)=4
3 round trips to get 20328 bytes, CW = 8
4 round trips to get 43560 bytes, CW = 16

In reality we do get packet loss, and we sometimes only send acks on pairs, so the real numbers may be worse.

Transferring 34ks of data from NY to Sydney takes 4 round trips with an initial window of 2 which explains the image above. It makes sense that I would be waiting over a second for 34K.

You may think that Http Keepalive helps a lot, but it does not. The congestion window is reset to the initial value quite aggressively.

TCP Slow Start is there to protect us from a flooded Internet. However, all the parameters were defined tens of years ago in a totally different context. Way before broadband and HTTP were pervasive.

Recently, Google have been pushing a change that would allow us to increase this number to 10. This change is going to be ratified. How do I know? There are 3 reasons.

Google and Microsoft already implemented it on their servers.
More importantly, the Linux Kernel has adopted it.
Google needs this change ratified if SPDY is to be successful.

This change drastically cuts down the number of round trips you need to transfer data:

1 round trip to get 14520 bytes, IW = 10
2 round trips to get 43560 bytes, CW = 20

In concrete terms, the same page that took 1.3 seconds to download could take 650ms to download. Further more, we will have a much larger amount of useful data after the first round trip.

That is not the only issue we have that is causing the web to be slow, SPDY tries to solve some of the others such as: poor connection utilization, inability to perform multiple requests from a single connection concurrently (like HTTP Pipelining without FIFO ordering) and so on.

Unfortunately, even if SPDY is adopted we are still going to be stuck with 2 round trips for a single page. In some theoretic magical world we could get page transfer over SCTP which would allow us to cut down on a connection round trip (and probably introduce another 99 problems).

Show me some pretty pictures

Enough with theory, I went ahead and set up a small demonstration of this phenomena.

I host my blog on a VM, I updated this VM to the 3.2.0 Linux Kernel, using debian backports. I happen to have a second VM running on the same metal, which runs a Windows server release.

I created a simple web page that allows me to simulate the effect of round trips:

<!DOCTYPE html>
<html>
  <head>
      <title>35k Page</title>
      <style type="text/css">
        div {display: block; width: 7px; height: 12px; background-color: #aaa; float: left; border-bottom: 14px solid #ddd;}
        div.cp {background-color:#777;clear:both;}
      </style>
  </head>
  <body><div class='cp'></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div> ... etc

The cp class is repeated approximately every 1452 bytes to help approximate segments.

Then I used the awesome webpagetest.org to test for downloading the page. The results speak louder than anything else I wrote here (you can view it here):

starting

Both of the requests start off the same way, 1 round trip to set up the TCP connection and a second before any data appears. This places us at T+700ms. Then stuff diverges. The fast Linux Kernel is (top row) is able to get significantly more data through in the first pass. The Windows box delivers me 2 rows (which is approx 2 segments), the Linux one about 6.

continue

At 1.1 seconds the Windows box catches up temporarily but then 1.3 seconds in the Linux box delivers the second chunk of packets.

done

At 1.5 seconds the Linux box is done and the Windows box is only half way through.

done2

At 1.9 seconds the Windows box is done.

Translated into figures, the Linux box with the IW of 10 is 21 percent faster when you look at total time. If you discount the connection round trip it is about 25 percent faster. All of this without a single change to the application.

What does this mean to me?

Enterprise Linux distributions are slow to adopt the latest stable kernel. Enterprisey software likes to play it safe, for very good reasons. SUSE enterprise is possibly the first enterprise distro to ship a 3.0 kernel. Debian, CentOS, Red Hat and so on are all still on 2.6 kernels. This leaves you with a few options:

Play the waiting game, wait till your enterprise distro backports the changes into the 2.6 line or upgrades to a 3.0 Kernel.
Install a backported kernel.
Install a separate machine running say nginx and a 3.0 kernel and have it proxy your web traffic.

What about Windows?

Provided you are on Windows 2008 R2 and have http://support.microsoft.com/kb/2472264 installed, you can update your initial congestion window using the command:

c:\netsh interface tcp set supplemental template=custom icw=10 
c:\netsh interface tcp set supplemental template=custom

See Andy’s blog for detailed instructions.

Summary

At Stack Overflow, we see very global trends in traffic, with huge amounts of visits coming from India, China and Australia – places that are geographically very far from New York. We need to cut down on round trips if we want to perform well.

Sure, CDNs help, but our core dynamic content is yet to be CDN accelerated. This simple change can give us a 20-30% performance edge.

Every 100ms latency you have costs you 1% of your sales. Here is a free way to get a very significant speed boost.

While preparing for my talk at Codemania I started filling my slides with links, clearly not something that scales. So, instead, here is a big list of interesting tools and resources that can help you journey through the murky waters of web performance.

Online page testers

Recommended: Web Page Test – The best free multi browser test platform.
Other similar tools are: Page Analyzer, Pingdom full page test
There are also monitoring tools with page test integration showslow, GTmatrix
REDbot is handy tool for cache validation. Are you missing an Expires or Cache-Control header?
HTTP Archive keeps historical performance snapshots of many sites online. See how your site has done over the last few years.
Zoompf (pay) offers a full site optimisation solution, the blog is well worth reading.

In browser waterfall UI

A key tool for analysis of any web performance issues is the waterfall UI. See also this Steve Souders blog post on waterfall UI conventions.

Recommended: Google Chrome – clean UI, option to disable cache, best waterfall UI. (CTRL-SHIFT-I) – note: Safari ships with the same tools – slightly dated (enable the develop menu)
Opera and IE ship with similar tools. Firefox has Firebug. Additionally there is the non-free HttpWatch(pay) for Firefox and IE.

Browser plugins and performance auditors

Recommended – Google Page Speed – trickiest to install, however provides very comprehensive results and helps you perform optimisations
Recommended – YSlow – the original performance audit tool, now open source
Recommended – Web Developer Toolbar very important for Firefox testing, allows you to easily disable cache or js.
Chrome Speed Tracer – A bit tricky to get running and provides an information overload, perhaps the richest performance profiler out there. Learn how much each selector is costing you.

PNG / JPEG Compressors

Recommended PNGOptimizer – Fast, excellent compression, nice GUI and optional command line
Notable mention: PNGQuant – Best compression, but loses quality (reduces palette to 256)
Other options: OptiPNG, PNGOUT, PNGCRUSH, PNGGauntlet (UI wrapper) and Trimage (UI wrapper for Linux, recommended by Johann)
Recommended Paint.NET – Most simple photo editing and re-compression tasks can be done here, nice jpeg compression preview.
JPEGmini – handy online tool for re-compressing jpeg files (recommended by Andy)

Sprite utilities and Data Uri tools

Recommended SpriteMe, sprites are tricky things that feel impossible sometimes. This bookmarklet makes the sprite problem trivial to solve. Also see, the Steve Sourders talk about it.
ASP.NET Sprite and Image Optimization Framework : easily integrate sprites into you ASP.NET and ASP.NET MVC apps.
The Ruby CSS framework compass has built-in sprite support

Web accelerators

Recommended Google’s mod_pagespeed – Comprehensive, after-the-fact optimiser. Automatically compresses CSS/JS, creates sprites, inlines small scripts, optimises browser rendering and more. Easy to set up as a proxy using Apache mod_proxy.
Recommended Request Reduce – IIS only. Bundling, minification and spriting.
For web accelerator as a service look at blaze.io(pay) (now owned by Akami) and strangeloop networks(pay).
Aptimize (pay) offers an IIS plugin similar to Request Reduce.
CloudFlare CDN + site accelerator service, requires you point your DNS records at CloudFlare. Yottaa(pay) and Torbit(pay) offer a similar service.

I need help choosing a CDN

Recommended CDNPlanet – list information about most CDN offerings out there. They also offer a service that allows you to measure your CDN performance for free.
Recommended Cloud Climate – see various CDNs performance in your browser

Debugging proxies and latency simulators

Recommended Fiddler must have debugging tool, many advanced features. Also see: StressStimulus a load testing extension.
Recommended DUMMYNET – the only way to properly simulate a high latency connection, all the rest do not operate at the TCP level so do not properly simulate TCP slow start. ipfw is already installed on all Mac OS X machines by default. On FreeBSD you will need to recompile the kernel for DUMMYNET support (well worth it though). Also works on Windows x32 as a ndis driver (no x64 support)
CharlesProxy (pay) multi-platform web debugging proxy

JavaScript loaders

Often loading JavaScript files synchronously can be the slowest part of your page.

LabJS appears to be the de-facto standard async js loader
There is a great matrix comparing all the loaders at on the defer.js page
Check out this presentation by @aaronpeters

JavaScript tools and libraries

jsPerf: community driven JavaScript performance testing playground
bommarang.js: open source library for measuring client side performance

Production web profilers

Recommended MiniProfiler (.NET, official Perl/Ruby port in progress) – clearly being the a co-author I would recommend this, try it out – you will love it. There is also a PHP fork.
Google App Engine Mini Profiler (Python, Free) – a similar tool to MiniProfiler by Ben Kamens.
Recommended New Relic (Lite version is free) – birds eye profiling of your production app, been around for quite a few years, extensive reporting.
Recommended Google Analytics contains a feature that allows you to track client side performance, but be warned, you are going to need to filter out some data.

Profilers and web profilers

dynaTrace AJAX Edition (pay): part performance auditor, part JavaScript profiler it attempts to automatically catch what is making your page slow. Windows only, I could only get it to work in IE.

JS / CSS minifiers and compressors

Recommended Google closure compiler the most effective JS minifier.
Recommended YUI Compressor pretty good JS minifier, also minifies CSS.
Other options are: Uglify.js and JSMin

Network capture tools

Recommended Wireshark the de-facto standard, cross-platform GUI for capturing and analyzing network traces
Microsoft Network Monitor very easy to install and run Windows specific network capture tool.
TCPDump probably the only tool you would run in production, ships with Linux/BSD/Unix systems
Visual Roundtrip Analyzer – this is a very unique tool, a performance auditor with packet capture built in. Well worth a play, though I would be careful acting on some of the analysis.

Web load testing tools

WCAT Lightweight tool by Microsoft for load testing sites
JMeter Open source Java based web load tester

The future SPDY and beyond

SPDY white paper – very important read, SPDY solves many of the performance problems inherit in HTTP
RFC 3390 increasing TCP initcwnd to 10. The objection from Jim Gettys.
pjax – pushState + Ajax – this is a relatively new technique for having 1 page apps that look and feel like multi page apps. This allows you to share layout and header/footer and js parsing between pages. Used at 37signals.
My article about IW 10 (now enabled at Stack Overflow) – How to enable IW-10 on Windows.

Backend performance

It is true, usually the Golden Rule applies, optimising the backend is often the least bang for buck you can get. However, backend performance can not be ignored. Some sites are paying the largest amount of performance tax due to backend issues.

recommended .NET memory profiler – My favourite memory profiler for .NET, it also works in production. If you have too many objects in your managed heap your whole app will start stalling. This will help you find it.
recommended CPU Analyzer – a tool I wrote, allows you to track down WHY your CPU is pegged at 100% in production. (.NET only) This tool has saved us many many times.
recommended – SQL Server Profiler I love the flexibility and power SQL Profiler offers, if you are running Microsoft SQL Server you really should learn to use it.

Important blogs and web resources

(Must read) Steve Souders High Performance Web Sites: Steve is an authority on web performance, he works for Google and shares many of his stories on his blog
The strangeloop blog: Strangeloop often share very interesting graphs, infographics and stories.
The blaze.io blog: Excellent in-depth articles about web performance.
Book of Speed: Great web performance primer, even delves into TCP.
Velocity web performance conference – many talks and slides available
(Must read) Navigation Timing API spec, already ships in Firefox, IE and Chrome, unfortunately very limited, only provides timings for the page (no timings on a per-resource basis)
Walmart real user monitoring presentation – 100ms improvement lead to 1% increase in revenue.
Stanford Data Mining presentation about performance at Amazon (ppt)
How is caching implemented at Stack Overflow

Miscellaneous tools and links

Stack Exchange Data Explorer – SEDE allows you to run db queries in your web browser and easily share results, we use it internally to perform analysis on our web logs (that are stored in a specific SQL Server instance). You can see a live demo here.
Web Performance Cheat Sheet – a bunch of stats correlating performance to your site’s success.

Note: I will be updating this list, if there is anything I missed, please let me know in a comment.

There are over 3 million distinct links in the Stack Exchange network. Over time many of these links rot and stop working.

Recently, I spent some time writing tools to determine which links are broken and assist the community in fixing them.

How we do it?

First things first, we try to be respectful of other people’s websites.

Being a good web citizen

Throttle requests per domain

We use this automatically expiring set to ensure we do not hit a domain more than once every ten seconds, we make a handful of exceptions where we feel we need to test links a bit more aggressively:

public class AutoExpireSet<T>
{

    Dictionary<T, DateTime> items = new Dictionary<T, DateTime>();
    Dictionary<T, TimeSpan> expireOverride = 
         new Dictionary<T, TimeSpan>();

    int defaultDurationSeconds; 

    public AutoExpireSet(int defaultDurationSeconds)
    {
        this.defaultDurationSeconds = 
           defaultDurationSeconds;
    }


    public bool TryReserve(T t)
    {
        bool reserved = false;
        lock (this)
        {
            DateTime dt;
            if (!items.TryGetValue(t, out dt))
            {
                dt = DateTime.MinValue;
            }

            if (dt < DateTime.UtcNow)
            {
                TimeSpan span;
                if (!expireOverride.TryGetValue(t, out span))
                {
                    span = 
                     TimeSpan.FromSeconds(defaultDurationSeconds);
                }
                items[t] = DateTime.UtcNow.Add(span);
                reserved = true;
            }

        }
        return reserved;
    }


    public void ExpireOverride(T t, TimeSpan span)
    {
        lock (this)
        {
            expireOverride[t] = span;
        }
    }
}

A robust validation function:

Our validation function captures many concepts I feel are very important.

public ValidateResult Validate(
      bool useHeadMethod = true, 
      bool enableKeepAlive = false, 
      int timeoutSeconds = 30 )
{
    ValidateResult result = new ValidateResult();

    HttpWebRequest request = WebRequest.Create(Uri) 
                                  as HttpWebRequest;
    if (useHeadMethod)
    {
        request.Method = "HEAD";
    }
    else
    {
        request.Method = "GET";
    }

    // always compress, if you get back a 404 from a HEAD
    //     it can be quite big.
    request.AutomaticDecompression = DecompressionMethods.GZip;
    request.AllowAutoRedirect = false;
    request.UserAgent = UserAgentString;
    request.Timeout = timeoutSeconds * 1000;
    request.KeepAlive = enableKeepAlive;

    HttpWebResponse response = null;
    try
    {
        response = request.GetResponse() as HttpWebResponse;

        result.StatusCode = response.StatusCode;
        if (response.StatusCode == 
                   HttpStatusCode.Redirect ||
            response.StatusCode == 
                   HttpStatusCode.MovedPermanently ||
            response.StatusCode == 
                   HttpStatusCode.SeeOther || 
            response.StatusCode == 
                   HttpStatusCode.TemporaryRedirect)
        {
            try
            {
                Uri targetUri = 
                  new Uri(Uri, response.Headers["Location"]);
                var scheme = targetUri.Scheme.ToLower();
                if (scheme == "http" || scheme == "https")
                {
                    result.RedirectResult = 
                        new ExternalUrl(targetUri);
                }
                else
                {
                    // this little gem was born out of 
                    //   http://tinyurl.com/18r 
                    //   redirecting to about:blank
                    result.StatusCode = 
                           HttpStatusCode.SwitchingProtocols;
                    result.WebExceptionStatus = null;
                }
            }
            catch (UriFormatException)
            {
                // another gem ... people sometimes redirect to
                //    http://nonsense:port/yay 
                result.StatusCode = 
                    HttpStatusCode.SwitchingProtocols;
                result.WebExceptionStatus =
                    WebExceptionStatus.NameResolutionFailure;
            }

        }
    }
    catch (WebException ex)
    {
        result.WebExceptionStatus = ex.Status;
        response = ex.Response as HttpWebResponse;
        if (response != null)
        {
            result.StatusCode = response.StatusCode;
        }
    }
    finally
    {
        try
        {
           request.Abort();
        }
        catch 
        { /* ignore in case already 
           aborted or failure to abort */ 
        }

        if (response != null)
        {
            response.Close();
        }
    }

    return result;
}

From day 0 set yourself up with a proper User Agent String.

If somehow anything goes wrong you want people to be able to contact you and inform you. Our link crawler has the user agent string of Mozilla/5.0 (compatible; stackexchangebot/1.0; +http://meta.stackoverflow.com/q/130398).

Handle 302s, 303s and 307s

Even though error codes 302 and 303 are fairly common, there is a less common 307 redirect. It was introduced as a hack to work around misbehaving browsers as explained here.

A prime example of a 307 would be http://www.haskell.org. I strongly disagree with a redirect on a home page, URL rewrite and many other tools can deal with this use case without the extra round trip; nonetheless, it exists.

When you get a redirect, you need to continue testing. Our link tester will only check up to 5 levels deep. You MUST have some depth limit set, otherwise you can easily find yourself in an infinite loop.

Redirects are odd beasts, web sites can redirect you to about:config or to an invalid URL. It is important to validate the information you got from the redirect.

Always abort your request once you have the information you need.

In the TCP protocol, when packets are acknowledged special status flags can be set. If the client sends the server an packet with the FIN flag set the connection is terminated early. By calling request.Abort you can avoid downloading a large and possibly big payload from the server in case of a 404.

When testing links, you often want to avoid HTTP keepalive as well. There is no reason to burden the servers with additional connection maintenance when our tests are far apart.

A functioning abort also diminishes from the importance of compression, however I still recommend enabling compression anyway.

Always try HEAD requests first then fall-back to GET requests

Some web servers disallow the HEAD verb. For example, Amazon totally bans it, returning a 405 on HEAD requests. In ASP.NET MVC, often people explicitly set the verbs the router passes through. Often developers overlook adding HttpVerbs.Head when restricting a route to HttpVerbs.Get. The result of this is that if you fail (don’t get a redirect or a 200) you need to retry your test with the GET verb.

Ignore robots.txt

Initially I planned on being a good Netizen and parsing all robots.txt files, respecting exclusions and crawl rates. The reality is that many sites such as GitHub, Delicious and facebook all have a white-list approach to crawling. All crawlers are banned except for ones they explicitly allow (usually Google, Yahoo and Bing). Since a link checker is not spidering a web site and it is impractical to respect robots.txt, I recommend ignoring it, with the caveat of the crawl rate – you should respect that. This was also discussed on Meta Stack Overflow.

Have a sane timeout

When testing links we allow sites 30 seconds to respond, some sites may take longer … much longer. You do not want to heavily block your link tester due to a malfunctioning site. I would consider a 30 second response time a malfunction.

Use lots and lots of threads to test links

I run the link validator from my dev machine in Sydney, clearly serializing 3 million web requests that take an undetermined amount of time is not going to progress at any sane rate. When I run my link validator I use 30 threads.

Concurrency also raises a fair technical challenge considering the above constraints. You do not want to block a thread cause you are waiting for a slot on a domain to free up.

I use my Async class to manage the queue. I prefer it over the Microsoft Task Parallel Library for this use case, cause the semantics for restricting the number of threads in a pool is trivial and the API is very simple a lean.

Broken once does not mean broken forever

I am still adjusting the algorithm that determines if a link is broken or not. One failure can always be a fluke. A couple of failures in a week could be a bad server crash or an unlucky coincidence.

At the moment 2 failures, a day apart do seem to be correct most of the time – so instead of finding the perfect algorithm we will allow users to tell us when we made a mistake and assume a small margin of error.

In a similar vein we still need to determine how often we should test links after a successful test. I think, once every 3 months should suffice.

Some interesting observations from my link testing

Kernel.org was hacked

On the 1st of September 2011 Kernel.org was hacked, what does this have to do with testing links you may ask?

Turns out that they broke a whole bunch of documentation links, these links remain broken today. For example: http://www.kernel.org/pub/software/scm/git/docs/git-svn.html appeared in 150 or so posts on Stack Overflow, yet now takes you to an unkind 404 page, instead of its new home at: http://git-scm.com/docs/git-svn. Of all the broken links I came across the broken git documentation is the worst failure. Overall it affected over 6000 posts on Stack Overflow. Fixing it in an Apache rewrite rule would be trivial.

Some sites like giving you no information in the URL

The link http://www.microsoft.com/downloads/details.aspx?familyid=e59c3964-672d-4511-bb3e-2d5e1db91038&displaylang=en is broken in 60 or so posts. Imagine if the link was http://www.microsoft.com/downloads/ie-developer-toolbar-beta-3. Even when Microsoft decided to nuke this link from the Internet we could still make a sane guess as to where it would take me.

Make your 404 page special and useful – lessons from GitHub

Of all the 404 pages I came across, the one on GitHub enraged me most.

Why you ask?

GitHub 404

It looks AWESOME, there is even an AMAZING parallax effect. Haters gonna hate.

Well, actually.

https://github.com/dbalatero/typhoeus is linked from 50 or so posts, it has moved to https://github.com/typhoeus. GitHub put no redirect in place and simply take you to a naked 404.

It would be trivial to do some rudimentary parsing on the url string to determine where you really wanted to go:

I am sorry, we could not find the page you have linked to. Often users rename their accounts causing links to break. The “typhoeus” repository also exists at:

https://github.com/typhoeus

There you go, no smug message telling me I made a mistake attempting Jedi mind tricks on me. GitHub should take ownership of their 404 pages and make them useful. What bothers me with the GitHub 404 most, is the amount of disproportionate effort invested. Instead of giving me pretty graphics, can I have some useful information please.

You could also take this one step further and properly redirect repositories to their new homes, I understand that account renaming is tricky business, however it seems to be an incredibly common reason for 404 errors on GitHub.

At Stack Overflow we spent a fair amount of time optimising for cases like this. For example take “What is your favourite programmer joke?”. The community decided this question does not belong. We do our best to explain it was removed, why it was removed and where you could possibly find it.

Oracle’s dagger

Oracle’s acquisition of Sun dealt a permanent and serious blow to the Java ecosystem. Oracle’s strict mission to re-brand and restructure the Java ecosystem was mismanaged. A huge amount of documentation was not redirected initially. Even to-date all the projects under dev.java.net do not have proper redirects in place. Hudson, the Java continuous integration server used to live at https://hudson.dev.java.net/, it is linked from over 150 Stack Overflow posts.

Personal lessons

The importance of the href title

In the age of URL shortners and the rick roll it seems that having URIs convey any sane information about where they will take you is less than encouraged. The reality though is that over 3 years probably 5% of the links you have are going to simply stop working. I am sure my blog is plagued with a ton of broken links as well. Fixing broken links is a difficult task. Fixing it without context is much harder.

That is one big reason I am now going to think a little bit more about writing sane titles for my hyperlinks. Not only does this encourage usability, improve search engine results and help the visually impaired, it helps me fix these links when they eventually break.

The fragility of the hyperlink

When we use Google we never get a 404. It shields us quite effectively from an ever crumbling Internet. Testing a large amount of links teaches you that the reality is far from peachy. Does this mean I should avoid linking? Heck no, but being aware of this fact can help me think about my writing. I would like to avoid writing articles that lose meaning when a link dies. On Stack Overflow we often see answers to the affect of:

See this blog post over here.

These kind of answers fall apart when the external resource dies and neglect to acknowledge the nature of the Internet.

For the last 2 years I lived and breathed my job at Stack Exchange, now I am moving on. It has been an awesome ride, I wanted to recap on some of the things I did and perhaps explain why I am moving on.

Data Explorer

One day I checked my mail and got an email from Jeff Atwood saying I am in the “short list of moon people” he would like to hire, “Obviously, this would be working remotely in a distributed fashion”. A few days later I was working directly with Jeff on an open source project called “Data Explorer”. The goal was to provide a place for Stack Exchange users and the Internet at-large to run SQL queries against the various Stack Exchange public data dumps. It lives at http://data.stackexchange.com. I am incredibly proud of my creation (and the various community contributions)

I was hired as a contractor, my work on Data Explorer was a trial. Jeff and I were on a “date”, we wanted to see if this whole thing could work out.

It worked out quite well. Working remotely was perfect for both of us.

A few months later I was anointed Stack Overflow valued associate #00008.

Early work

We have a tradition at Stack Exchange; when you start out at the core team, you can pick a random bug that bothers you and fix it. My first bug was sorting out browser history on the user page, it used to drive me mad that when you clicked the back button after paging through a few answers would reset the paging.

After that, I worked on revamping our badge system to associate reasons with badges, built the tag synonym system, created the tag wiki system, created the answer and question draft system and designed a system for internal analytics. Much of this was initiated by me, much of this was initiated by Jeff.

I also created our database migration system because I tired of running SQL directly against the DBs in true cowboy fashion.

This all happened in the first few months. I was mega productive with very little legacy to maintain.

A few months in

After a few months I really found my groove, I was busy doing lots of performance work, fixing bugs and building cool things. I introduced a few diagnostic tools to monitor server render time and SQL Server performance. I also started working on some top secret algorithms to help us deal with the slew of low quality questions at Stack Overflow, we made some drastic improvements. I also redesigned the Stack Overflow home page with Jeff.

After that I started working on the feature which got us to what Jeff believes is the v1.0 of our engine, suggested edits. This feature allows anonymous and new users to submit edits to any content on our sites. Ever since we deployed this feature we got over 250 thousand edits we would not have seen otherwise.

Throughout this time I learned much about the game mechanics that make Stack Overflow work and found my groove as a performance “expert”.

Dapper

I blogged about the reason we created Dapper our own micro ORM. I created it with Marc to combat some very hairy performance issues we were experiencing. These days we use Dapper for the vast majority of data access on Stack Overflow, with the caveat that some still use LINQ to SQL for write based work. It was and is an enormous success for our use case.

MiniProfiler

MiniProfiler started around around April 2011, while the entire dev team except for Marc and Jeff were at a dev conference. There was an urgent performance issue, Jeff was browsing through the site and noticed a delay. This delay really bothered him. Marc quickly ported a system he built for chat to the main site. This system shoves a comment at the bottom of the page explaining where the time went, in production. I helped improve this design by providing some richer database query information, Jarrod took this from concept to a polished product and did a smashing job. I have taken a very active part in this project recently, adding client timings, promoting it and reviewing/accepting patches. I blogged about MiniProfiler when it was released. I am incredibly proud of my contribution to the project and am delighted to hear it helps people. I am incredibly proud our humble project is on the second page at nuget.org.

A more mature product

In my second year I had a fair amount of features to maintain. Additionally, we accrued a big pile of “performance debt” we had to pay back. I spent a lot of time optimising server performance, a prime example is our tag engine, I talked about here. Further more, I spent a lot of time learning about client performance and deploying many best practices on the site. Improving something that is already extremely optimised due to work Ben put in, is very tricky and the rewards are not huge. Overall, I feel our performance story on Stack Overflow is pretty spectacular.

During this time I also refined the review section on the site, worked on improving the moderation story, improved the bounty system and built the post notice system. Recently tested every external link on our sites.

Working remote

About six months ago Jeff, who lives in California, left. Additionally, Kevin moved to the NY. This left only one team member, Geoff on the west coast. Rebecca is the only other core team member slightly closer to my time zone is on CDT. In winter daylight saving time kicks in (and out); time zones move a couple of hours apart. Unfortunately, due to this, I was left with very little business hour overlap with the rest of my team. I love working remote but being 14 hours apart with the majority of the company presents some unique challenges. California is 3 hours closer, offering some business time overlap – even in winter.

Working with incredibly talented people

I feel extremely lucky to have worked with the Stack Exchange core team, everyone is incredibly committed and incredibly talented, I will sorely miss working with them. It is a rare treat to work in such a team.

Not a week passed without me learning something new and interesting from my team members, I thank them dearly for that.

What is next?

I want to be a founder or a mercenary.

I never took the job at Stack Exchange for the money. In fact, I had a small pay cut. I took the job because I was in love with the mission, I knew I could change things and make the Internet a better place. I think I did in my own little way. I love the fact that I am leaving this job with a big pile of public artefacts. This is worth much more than cold cash.

However, I am not getting any younger, I would like to have a large stake in the next project I take. The only way to have a large stake is to be a founder.

Alternatively, in the near future, I am comfortable being a mercenary; look for a few short term contracts that pay really well and use that cash to build something awesome.

I have plenty to keep me busy for the next few months, there is Community Tracker and Media Browser that need love. I would like to finish and publish MiniProfiler for Ruby. On a personal note, I miss working with Ruby.

I would like to thank all the awesome people at Stack Exchange for the amazing ride, I am sure you will continue to make the Internet a better place, keep rocking.

About a year ago we released a MiniProfiler for .Net. Ever since, I have been wishing for a Ruby port.

It seems I was not alone as this Stack Overflow question attests.

I am very happy to now announce our first release of MiniProfiler for Ruby. This has been made possible thanks to a large amount of effort by Aleks Totic, Robin Ward and myself. It reuses the UI created by Jarrod Dixon.

Why do you care, what does it do?

MiniProfiler is a ubiquitous production and development profiler, it allows you to quickly isolate performance bottlenecks, both on the server and client.

example of production profiling on this blog

In the example above it is clear there is an N+1 query executing for my comments on this blog. This could also be gleaned using bullet, New Relic or the traditional Rails log, however having this clearly visible on the page you are are navigating makes a very big difference.

MiniProfiler will display timings for POST requests, failed requests, json requests and so on. It keeps track of a list of “sessions” you have not seen and displays them all to you next time you have access to a UI. It allows you to easily tell how much time is spent in SQL and how much time outside of SQL.

Additionally, MiniProfiler makes it VERY easy for you to share profiling sessions with other members of your team, just send them the share link.

As an extra added bonus, MiniProfiler allows you to add probes to your html so you can tell when the client started rendering the BODY and so on (out there, scripts are usually a big cause of slow pages)

Installation and help

We will be keeping the documentation at GitHub up to date, there is also a support site at http://community.miniprofiler.com

Installation on Rails is fairly trivial, all you need to do is add the following to your Gemfile:

gem 'rack-mini-profiler'

We have a robust railtie that takes care of hooking stuff up.

In production, you are going to need to white list requests you would like to enable profiling on, for example this blog uses:

before_filter :admin_check

protected 

def admin_check
   # required only in production
   if is_admin? 
      Rack::MiniProfiler.authorize_request
   end
end

Current impact of production profiling

MiniProfiler is designed to be minimally invasive in production. However, in Ruby, method calls can get rather expensive and a few are needed to determine if profiling is allowed or not. For Rails, MiniProfiler inserts itself first in the Rack chain even before Rack::Lock, the middleware then determines if profiling is required or not. In production – it uses a cookie to determine if it is going to even try profiling, if the application whitelists the request the results are displayed, if not, the cookie is reset.

Additionally, we need to intercept the db calls with a method chain and a few checks are performed there as well.

I have tested this and am running a few sites with production profiling on, I am not able to detect any noticeable impact for unauthorized users. We are open to patches that make this area have even less impact.

If you are worried you can always only load the gem in development.

Enough chatter, show me an example

Recently, I upgraded my sites to Rails 3. I had an admin user page that displays a few fields of all 6,000 users in the system on one page. I was a bit surprised to see that rendering this page now takes about 1 second in production. Digging in to this problem was quite interesting.

time in SQL

Looking at the initial trace I could see that the time in SQL – about 280ms or so was spent in SQL. The first surprise I had was that Active Record was reporting a far lower number, 51ms only.

Started GET "/admin/users" for 10.0.0.6 at 2012-07-12 19:52:12 +1000
Processing by Admin::UsersController#index as HTML
  Rendered admin/users/index.haml within layouts/default (228.9ms)
["MB Store", "http://www.mediabrowser.tv/store", false]
  Rendered default/_navbar.haml (4.4ms)
Completed 200 OK in 955ms (Views: 242.2ms | ActiveRecord: 51.1ms)

Which makes one wonder, who is telling the truth.

Well, we can use MiniProfiler to figure this out, I attempted to query the data raw:

def index

  step("Raw") do
    a = []
    users = nil

    step("All") do 
      users = ActiveRecord::Base.connection.raw_connection
        .query('select email from users')
    end
    step("Iteration") do 
      users.each do |u|
        a << u
      end
    end
  end
  @users = User.all
end

protected 
def step(name, &blk)
  Rack::MiniProfiler.step(name, &blk)
end

This produced:

raw results

Aha, so even though the query is quite cheap, streaming it from the DB is NOT being measured in the Rails logs. Browsing through my site I have noticed that quite consistently the time to iterate through the results quite often exceeds the initial execution. This is, in fact, by design.

MiniProfiler is smart enough to intercept both pg and mysql2 for these cases. If you are using another DB it has fall-backs for ActiveRecord and Sequel interceptions.

However, we are not done.

I was still a bit confused here, where can the rest of the time be hiding, so I added the ?pp=sample option to the page to gather some call stacks. This option attempts to gather call stacks from the thread running the page (one every millisecond). The reality is that the Ruby VM will not schedule it that aggressively and you only get 120 or so stack for a 1 second operation.

Anyway, I could see that a whole bunch of stacks looked like this:

 block in execute
 block in log
ActiveSupport::Notifications::Instrumenter instrument
ActiveRecord::ConnectionAdapters::AbstractAdapter log
Rack::MiniProfiler::ActiveRecordInstrumentation log_with_miniprofiler
ActiveRecord::ConnectionAdapters::AbstractMysqlAdapter execute
ActiveRecord::ConnectionAdapters::Mysql2Adapter execute
ActiveRecord::ConnectionAdapters::Mysql2Adapter exec_query
ActiveRecord::ConnectionAdapters::Mysql2Adapter select
ActiveRecord::ConnectionAdapters::DatabaseStatements select_all
 block in select_all
ActiveRecord::ConnectionAdapters::QueryCache cache_sql
ActiveRecord::ConnectionAdapters::QueryCache select_all
 block in find_by_sql
ActiveRecord::Explain logging_query_plan
ActiveRecord::Querying find_by_sql
ActiveRecord::Relation exec_queries
 block in to_a
ActiveRecord::Explain logging_query_plan
ActiveRecord::Relation to_a
ActiveRecord::FinderMethods all
ActiveRecord::Querying all

So I added this line to app/config/initializers/mini_profiler.rb (you will need to create it)

Rack::MiniProfiler.profile_method(ActiveRecord::Querying, "all")

The I ran the page again:

after

Interesting, time in ActiveRecord and SQL is actually responsible for at least 700ms here, the logs are WAY off.

You may claim that this example is ridiculous.

You should only be selecting the fields you need (which is true).
You should be paging for that much date (sometimes can be true).
If you are displaying that much data and use link_to on each row that will take up more time than your db time (also true)
You should be caching (sure)
Server time is usually negligible compared to client time (MiniProfiler also display client time so you can answer that yourself)

My point here is that, more often than not, the amount of time spent in ActiveRecord is way higher than the number displayed in the logs. MiniProfiler can help you dig in to this and give you nice stack traces to explain what is happening and where it is happening.

Further more, this is low impact, ready to go in dev and easily added to production.

Start profiling your apps, in production, you will be amazed at the improvements you can make. Use MiniProfiler, use Google PageSpeed, use WebPageTest. Most of the performance tools out there are complimentary.

PS. The website http://miniprofiler.com is very .NET centric, we would love a new design that integrates the Ruby port. If you feel like making that site better – fork it – and send a pull request. Any contributions to MiniProfiler are more than welcome.

Much has been said about the absolute failure of iOS 6 in the mapping department. Apple took one of its killer features in iOS and destroyed it. Completely.

Tim Cook is very clear about the reason the “old” mapping app was abandoned.

We launched Maps initially with the first version of iOS. As time progressed, we wanted to provide our customers with even better Maps including features such as turn-by-turn directions, voice integration, Flyover and vector-based maps. In order to do this, we had to create a new version of Maps from the ground up.

Speculation is running rampant, did they have a year left before they were required to strip out the Google branded maps? Is this a result of the souring of relationship between Apple and Google. Anyone can speculate.

What really bothers me is the naive sentiment expressed by Tim Cook.

iOS users with the new Maps have already searched for nearly half a billion locations. The more our customers use our Maps the better it will get and we greatly appreciate all of the feedback we have received from you..

No Results Found

I was using the maps app today and tried a few searches:

“Closest train station” – No results found
“bondi trat” – No results found
“20 craown st” – Totally crazy suburb somewhere far away
“nearest gym” – No results found

Same results in google maps:

“Closest train station” – highlight all the train stations near me
“bondi trat” – takes me to bondi trattoria which happens to have this nickname
“20 craown st” – a few options including the “20 crown street” closest to me
“nearest gym” – the gym across the road from my house

It is true, Google Maps has some pretty spectacular cartography, this is a company that is giving us underwater street view. However, Google’s amazing power, when it comes to maps, is search. They realised very early on that an uncompromising and spectacular search feature is what people need. They refine it all the time and have a corpus of data Apple is unlikely to have for a very long time, regardless of the petabytes Apple claim to have. Google use this data to refine the location search. Google is able to “learn” about new restaurants by crawling the Internet. It is able to make some incredibly accurate guesses based on incredibly sophisticated and refined algorithms.

Google was able to create the best local search out there without partnering with Yelp. Instead they took the harder and ultimately better approach of simply crawling the Internet and crunching the data.

The big issue with the Apple branded maps is not that a street or a town is missing here or there. That is easily fixed. The issue is much more fundamental. When I am using my maps app on my phone I am searching. This search needs to be spectacular. If it is not, I will use another product that does have a spectacular search. Sure, turn-by-turn navigation can be handy to have. It is, however, totally useless if I can not find the place I would like to navigate to.

Fixing street names is relatively easy, implementing a fantastic local search is incredibly hard. It also happens to be a business Apple has never been in.

Google knew this farce was going to take place back in June when iO6 was in its first beta. I wonder, why did they not release a native maps app for the iPhone, just like they did for YouTube? Are they trying to convince me to switch to android?

Conspiracies abound.