I have moved!

I've moved my blog
CLICK HERE

Wednesday, 10 December 2008

Optimizing Aggregate for String Concatenation

Linq lets you think in a very generally applicable way and solve a very wide variety of problems with a few key concepts. That’s a great thing. But it’s irritating when the elegant solution doesn’t perform as well as an ugly special case.

Using a combination of well-known Linq features I’m going to demonstrate that we already have the power to get the best of both worlds: speed and elegance.

One example that has always irked me (and which is simple enough to demonstrate the idea with) is this:

Enumerable.Range(0, size) 
          .Select(n => n.ToString())
          .Aggregate((a, b) => a + ", " + b);

It’s got all the attributes of a beautiful Linq-style solution – a single expression that produces the thing we want, using very general operations that are parameterized by self-contained functions.

But if size gets large, it’s dreadfully slow. The reason is that Aggregate takes the first two items, uses the function to combine them, then you can kind of imagine it putting the result back on the start of the list to replace the original two items. So each time it does that, the list shrinks, until eventually there’s only one item left, which it returns. All very logical and beautiful – the items on the list are from a set and we have defined a closed binary operation on them, so I guess it’s a magma.

But that first item is an ever-growing string, and each time around it has to be copied into a new string. This is a recipe for disastrous performance. To get faster performance, we need to use a different function, which I’ll call Oggregate:

public static string Oggregate(this IEnumerable<string> source, string delimiter)
{
    StringBuilder builder = new StringBuilder();
 
    builder.Append(source.First());
    foreach (string s in source.Skip(1))
    {
        builder.Append(delimiter);
        builder.Append(s);
    }
 
    return builder.ToString();
}

But this is irksome, because you have to know when to use this optimized version, and then you have to know how to modify your original code. You can’t just flick the “go faster” switch.

A good programming language defines general concepts that compose (go together) well. A good compiler then takes programs written in that language and does terrifying things to them, without telling you, so that they still do what you asked, but they do it fast. The compiler does this by spotting patterns and saying “Ah, I see what you’re trying to do. I know a faster way to do that.” You don’t have to explicitly tell it to do it the fast way. It just notices the pattern and makes the smart decision. Such capabilities are called compiler optimizations. They’re “turnkey” solutions, in the sense that all you have to do is turn the key and they start up. You don’t typically have to think too hard about it. Somebody already did that for you.

So my ideal solution for the above problem would be for the compiler (or a library) to notice the pattern I’m using and use the StringBuilder approach instead. If I’m not using that pattern, it should fall back to doing what it usually does.

I can’t change the compiler, so can I write a library? The problems facing us are three fold:

  • We want to replace the standard library’s version of an algorithm that can operate on any sequence.
  • We only want to do that for sequences of strings.
  • We only want to do it if the function parameter has a very narrowly-defined shape.

Skipping the first one for now, the solution to the second problem is to write a version of Aggregate for the special case of string sequences. The solution to the third is clearly going to involve Linq expressions, as they give us a way to examine the structure of simple expression-like functions, and also to apply such a function to some parameters if necessary:

public static string Aggregate(this IEnumerable<string> source, 
                 Expression<Func<string, string, string>> func)
{
    ... look at the func to decide what to do

Having defined such a function in some namespace, what happens if we add a using directive at the top of the source file where we want to use it? That’s no good, because the compiler has two choices for which function to use. In C++, the thing we’ve written is a kind of “template specialization”, and the rules for template resolution in C++ usually mean that the most specialized choice is the one that the compiler picks. But this isn’t the case in C#. The compiler just gives up and says that we’re being ambiguous.

But if we put our using directive inside a namespace block, then the C# compiler is happy to assume that it should select our version of Aggregate:

using System;
using System.Linq;
using System.Diagnostics;
 
namespace ConsoleApplication5
{
    using Optimizations.AggregateStringBuilder;
 
    class Program
    {
        static void Main(string[] args)
        {

The location of the using directive affects the overload resolution priority assigned to the functions in that namespace. So we can switch on our optimisation at the level of a namespace block. I’m satisfied with that – it constitutes an on/off switch, for my purposes.

Here’s what the Aggregate function looks like:

public static string Aggregate(this IEnumerable<string> source, 
                        Expression<Func<string, string, string>> func)
{
    BinaryExpression root = func.Body as BinaryExpression;
    if (root != null)
    {
        if (root.NodeType == ExpressionType.Add)
        {
            BinaryExpression left = root.Left as BinaryExpression;
            if (left != null)
            {
                if (left.NodeType == ExpressionType.Add)
                {
                    ParameterExpression leftLeft = 
                        left.Left as ParameterExpression;
 
                    if (leftLeft != null)
                    {
                        ConstantExpression leftRight = 
                            left.Right as ConstantExpression;
 
                        if (leftRight != null)
                        {
                            ParameterExpression right = 
                                root.Right as ParameterExpression;
 
                            if (right != null)
                            {
                                if ((leftLeft.Name == func.Parameters[0].Name) 
                                  && (right.Name == func.Parameters[1].Name))
                                    return source.Oggregate(
                                        (string)leftRight.Value);
                            }
                        }
                    }
                }
            }
        }
    }
 
    return source.Aggregate(func.Compile());
}

In other words, it looks pretty ugly. But that’s optimizations for you. It really just looks at the lambda to see if it fits a very rigid pattern: (a + c) + b, where a and b are the parameters to the lambda and c is a constant. If so, it calls Oggregate. Otherwise, it falls back to letting the compiler run the lambda as it usually would. A compiled delegate doesn’t match our Expression<T> argument, so the normal version of Aggregate is called.

This means that, where it’s an appropriate optimization, all it does is examine a few enum properties, perform a few casts, compare a couple of strings and then call Oggregate. So all the extra work (which is very minor) happens outside the loop.

The remaining question is, how badly does it hurt performance when it isn’t an appropriate optimization? As usual, it depends greatly on how you use it. If you’re using Aggregate to concatenate a small number of strings, the compilation step is wasteful to be sure. But again, it happens outside the loop. And in any case, if you find that your program runs slower, it’s just as easy to switch off the optimization as it is to switch it on.

So in conclusion, although this example is pretty simple and so not exactly earth-shattering in itself, it serves to demonstrate how C# gives us the tools to introduce our own turnkey optimizations for special cases of the very general algorithms available in Linq, which was a pleasant surprise for me.

I posted this as a quiz question on Stack Overflow. The reaction to the idea of someone posting a question as a quiz was quite mixed – the votes for the question went negative a few times before averaging out at zero. I think this is a good use of the site, because the end result is a question with an answer and some associated discussion to give more detailed background, alternative approaches, etc. I think it annoyed people who didn’t know the answer, because (as I would freely admit) part of the reward of using Stack Overflow is the ego-boost of handing out knowledge to those in need. If you meet someone who already knows the answer to their question, and then – even worse – you don’t know the answer, then it kind of spoils the fun for you (one guy in particular seemed quite upset). But the fact remains that such an exercise produces the same kind of valuable addition to the site.

5 comments:

Jimmy said...

Oh come on :P the correct implementation should check if the entire expression is an arbitrary tree of only string concatenation (and possibly Replace()/ Substring()) and generate a functionally equivalent expression using StringBuilders ;)

Daniel said...

I am seriously considering that! Or at least allowing the parameter expressions to be any expression in terms of that parameter, so Substring/Trim etc would be useable.

Anonymous said...

What's wrong with:

Enumerable.Range(0, size ).Aggregate(new StringBuilder(),
(a, b) => a.Append(", " + b.ToString(),
(a) => x.Remove(0,2).ToString());

This works with any type and allows you to call any function for that type that returns its value as a string.

davesem

Daniel said...

The key point is the ability to switch an optimization on/off by merely adding/removing a using statement. It's that general idea I was trying convey, not just as a solution for string concatenation. Obviously there are a ton of ways to use a StringBuilder to solve this particular problem.

As for what's wrong with it (you did ask!) should that 'x' be an 'a'? Also it has a slight maintenance gotcha, as I might change the length of the delimiter string and forget to also change the 2 - so make that a named variable. And personally I think it's kind of ugly anyway, with the way it removes something it just appended.

Anonymous said...

Yes, the 'x' should be an 'a' and I left out the closing ')' to the Append method but that notwithstanding I'm not sure I see your point, the title is "Optimizing Aggregate for String Concatenation" not "Turning Optimization on/off By Merely Adding/Removing a Using Statement". Hidden optimization code can be unpredictable and a maintenance nightmare. I've ran into the occasional compiler optimization that breaks code (that why you can shut it off ;)).

Don't misunderstand your code is fairly cool. But as you said "it looks pretty ugly", if there were some type of bug in your optimization (I'm sure there isn't but I occasionally have some bugs in mine) someone trying to debug it could easily overlook the 'using' statement and wonder why code doesn't run as expected.

davesem