I have moved!

I've moved my blog
CLICK HERE

Sunday 1 March 2009

The Amazing Speed of the .NET Garbage Collector

Coming from a C++ background, I always balk at designs that involve lots of small heap allocations. It’s something I’m gradually training myself to stop doing, because in managed code, it’s a completely irrational concern.

Given the constant usefulness of lambdas for things like this, it is worth reassuring myself (and others) that the overhead of such techniques is negligible. I already knew it probably was, but I had no idea just how negligible. It really is ridiculously unimportant.

Here’s my simple test program – most of the operations it does are meaningless busywork (like a typical school day). The important thing is the difference between Foo1 and Foo2:

public class Program
{
    private static readonly Queue<string> _q = new Queue<string>();

    public static int _c;

    public static void Push() { _q.Enqueue(_c.ToString()); }
    public static void Pop() { _q.Dequeue(); }
    public static void Count() { _c++; }

    public static int Foo1()
    {
        int x = 0;
        Push();
        Count();
        x++;
        Pop();
        return x;
    }

    public static void Gateway(Action a)
    {
        Push();
        a();
        Pop();
    }

    public static int Foo2()
    {
        int x = 0;
        Gateway(() =>
                       {
                           Count();
                           x++;
                       });
        return x;
    }

    public static double Time(Action a)
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        a();
        s.Stop();
        return s.ElapsedMilliseconds;
    }

    public static void Main(string[] args)
    {
        _q.Enqueue("x");
        _q.Enqueue("y");
        _q.Enqueue("z");

        int reps = 100000000;

        double direct = Time(() =>
        {
            for (int n = 0; n < reps; n++)
                Foo1();
        });
        
        Console.WriteLine("Direct, " + reps + " calls: " + direct / 1000 + " s");

        double lambda = Time(() =>
        {
            for (int n = 0; n < reps; n++)
                Foo2();
        });

        Console.WriteLine("Lambda, " + reps + " calls: " + lambda / 1000 + " s");

        double overhead = (lambda - direct);
        double percall = overhead / reps;

        Console.WriteLine("Overhead per call: " + percall * 1000000 + " ns");
    }
}

Foo1 calls Push, Count and Pop and also fools around with a local variable, x. Note that Push and Pop manipulate a queue of strings. The queue stays a constant length during the test, with newly allocated strings being inserted and old strings being removed, so the GC has some work to do. The reason I used a queue was so that the lifetimes of GC objects would not simply coincide with the stack of function calls, forcing the CLR (I vaguely assume) to truly allocate objects in Generation 0 instead of “cheating” and using the stack to store everything.

Foo2 does the same, but it does so via Gateway. Also note that it captures the local variable x in a closure, so it can be read and modified in Foo2 and in the nested lambda.

This means that for every call to Foo2, it is necessary to allocate two GC objects. Firstly, the closure is implemented by a compiler generated class, which would look like this:

[CompilerGenerated]
private sealed class <>c__DisplayClass1
{
    public int x;

    public void <Foo2>b__0()
    {
        Program.Count();
        x++;
    }
}

The local variable x has become a field of that class. So every time Foo2 is called, an instance of <>c__DisplayClass1 is allocated to store the variable x. Then there’s the lambda, which has become the method <Foo2>b__0 of the class. In order to pass it to Gateway, an instance of the Action delegate must be created that has an Invoke method that forwards to <Foo2>b__0.

To confirm this, look at the IL for Foo2:

.method public hidebysig static int32 Foo2() cil managed
{
    .maxstack 3
    .locals init (
        [0] class Program/<>c__DisplayClass1 CS$<>8__locals2)
    L_0000: newobj instance void Program/<>c__DisplayClass1::.ctor()
    L_0005: stloc.0 
    L_0006: ldloc.0 
    L_0007: ldc.i4.0 
    L_0008: stfld int32 Program/<>c__DisplayClass1::x
    L_000d: ldloc.0 
    L_000e: ldftn instance void Program/<>c__DisplayClass1::<Foo2>b__0()
    L_0014: newobj instance void [System.Core]System.Action::.ctor(object, native int)
    L_0019: call void Program::Gateway(class [System.Core]System.Action)
    L_001e: ldloc.0 
    L_001f: ldfld int32 Program/<>c__DisplayClass1::x
    L_0024: ret 
}

Sure enough, there are two calls to newobj – the first one allocates the storage for the closure, the second allocates the Action delegate. Compare this with the much cleaner-looking code for Foo1:

.method public hidebysig static int32 Foo1() cil managed
{
    .maxstack 2
    .locals init (
        [0] int32 x)
    L_0000: ldc.i4.0 
    L_0001: stloc.0 
    L_0002: call void Program::Push()
    L_0007: call void Program::Count()
    L_000c: ldloc.0 
    L_000d: ldc.i4.1 
    L_000e: add 
    L_000f: stloc.0 
    L_0010: call void Program::Pop()
    L_0015: ldloc.0 
    L_0016: ret 
}

No allocations, just some method calls. So the overhead of Foo2 must be pretty big, right? Especially when those heap-allocated objects will need to be garbage-collected all the time. What if a lot of them build up? Won’t that put “pressure” on the GC and make everything grind to a halt?

Well, try running the test program. It does a hundred million calls to Foo1, and then the same number of calls to Foo2. It then figures out the difference in the timings for these runs. On my VM of Windows XP running on a MacBook Pro, the output is:

Direct, 100000000 calls: 27.209 s
Lambda, 100000000 calls: 33.515 s
Overhead per call: 63.06 ns

Now, it’s customary with this kind of test to observe that 33 seconds is longer than 27 seconds and try to conclude something from that, like “lambdas make your code about 25% slower”. However, that would be completely stupid. The point is that the code in the test is doing very little, apart from keeping the GC a little busy, so that makes the overhead appear more significant than it would be in a real program. The important point is that it only took 6 seconds longer even though we are doing a hundred million calls, i.e. the overhead per call is about 60 nanoseconds. Which is very, very small. Code that actually did something useful would make such an overhead utterly invisible.

Note that if the GC was not collecting at regular intervals during the test, we would have a build up of 400 million objects (100 million strings in Foo1, another 100 million in Foo2, which also needs 100 million closures and 100 million delegates). The overhead on an object is 8 bytes (two pointers used by the CLR), so that’s approximately 3 GB of memory requirement even without considering the actual data stored in the objects. So the GC clearly is cleaning up old objects during the test, or else it wouldn’t finish successfully.

So the GC heap is so fast that in a real program, even in tight loops, you can use closures and delegates without even giving it a second’s thought (or even a few nanosecond’s thought). As always, work on a clean, safe design, then profile to find out where the overhead is.

No comments: