Linq lets you think in a very generally applicable way and solve a very wide variety of problems with a few key concepts. That’s a great thing. But it’s irritating when the elegant solution doesn’t perform as well as an ugly special case.
Using a combination of well-known Linq features I’m going to demonstrate that we already have the power to get the best of both worlds: speed and elegance.
One example that has always irked me (and which is simple enough to demonstrate the idea with) is this:
It’s got all the attributes of a beautiful Linq-style solution – a single expression that produces the thing we want, using very general operations that are parameterized by self-contained functions.
But if size gets large, it’s dreadfully slow. The reason is that Aggregate takes the first two items, uses the function to combine them, then you can kind of imagine it putting the result back on the start of the list to replace the original two items. So each time it does that, the list shrinks, until eventually there’s only one item left, which it returns. All very logical and beautiful – the items on the list are from a set and we have defined a closed binary operation on them, so I guess it’s a magma.
But that first item is an ever-growing string, and each time around it has to be copied into a new string. This is a recipe for disastrous performance. To get faster performance, we need to use a different function, which I’ll call Oggregate:
But this is irksome, because you have to know when to use this optimized version, and then you have to know how to modify your original code. You can’t just flick the “go faster” switch.
A good programming language defines general concepts that compose (go together) well. A good compiler then takes programs written in that language and does terrifying things to them, without telling you, so that they still do what you asked, but they do it fast. The compiler does this by spotting patterns and saying “Ah, I see what you’re trying to do. I know a faster way to do that.” You don’t have to explicitly tell it to do it the fast way. It just notices the pattern and makes the smart decision. Such capabilities are called compiler optimizations. They’re “turnkey” solutions, in the sense that all you have to do is turn the key and they start up. You don’t typically have to think too hard about it. Somebody already did that for you.
So my ideal solution for the above problem would be for the compiler (or a library) to notice the pattern I’m using and use the StringBuilder approach instead. If I’m not using that pattern, it should fall back to doing what it usually does.
I can’t change the compiler, so can I write a library? The problems facing us are three fold:
- We want to replace the standard library’s version of an algorithm that can operate on any sequence.
- We only want to do that for sequences of strings.
- We only want to do it if the function parameter has a very narrowly-defined shape.
Skipping the first one for now, the solution to the second problem is to write a version of Aggregate for the special case of string sequences. The solution to the third is clearly going to involve Linq expressions, as they give us a way to examine the structure of simple expression-like functions, and also to apply such a function to some parameters if necessary:
Having defined such a function in some namespace, what happens if we add a using directive at the top of the source file where we want to use it? That’s no good, because the compiler has two choices for which function to use. In C++, the thing we’ve written is a kind of “template specialization”, and the rules for template resolution in C++ usually mean that the most specialized choice is the one that the compiler picks. But this isn’t the case in C#. The compiler just gives up and says that we’re being ambiguous.
But if we put our using directive inside a namespace block, then the C# compiler is happy to assume that it should select our version of Aggregate:
The location of the using directive affects the overload resolution priority assigned to the functions in that namespace. So we can switch on our optimisation at the level of a namespace block. I’m satisfied with that – it constitutes an on/off switch, for my purposes.
Here’s what the Aggregate function looks like:
In other words, it looks pretty ugly. But that’s optimizations for you. It really just looks at the lambda to see if it fits a very rigid pattern: (a + c) + b, where a and b are the parameters to the lambda and c is a constant. If so, it calls Oggregate. Otherwise, it falls back to letting the compiler run the lambda as it usually would. A compiled delegate doesn’t match our Expression<T> argument, so the normal version of Aggregate is called.
This means that, where it’s an appropriate optimization, all it does is examine a few enum properties, perform a few casts, compare a couple of strings and then call Oggregate. So all the extra work (which is very minor) happens outside the loop.
The remaining question is, how badly does it hurt performance when it isn’t an appropriate optimization? As usual, it depends greatly on how you use it. If you’re using Aggregate to concatenate a small number of strings, the compilation step is wasteful to be sure. But again, it happens outside the loop. And in any case, if you find that your program runs slower, it’s just as easy to switch off the optimization as it is to switch it on.
So in conclusion, although this example is pretty simple and so not exactly earth-shattering in itself, it serves to demonstrate how C# gives us the tools to introduce our own turnkey optimizations for special cases of the very general algorithms available in Linq, which was a pleasant surprise for me.
I posted this as a quiz question on Stack Overflow. The reaction to the idea of someone posting a question as a quiz was quite mixed – the votes for the question went negative a few times before averaging out at zero. I think this is a good use of the site, because the end result is a question with an answer and some associated discussion to give more detailed background, alternative approaches, etc. I think it annoyed people who didn’t know the answer, because (as I would freely admit) part of the reward of using Stack Overflow is the ego-boost of handing out knowledge to those in need. If you meet someone who already knows the answer to their question, and then – even worse – you don’t know the answer, then it kind of spoils the fun for you (one guy in particular seemed quite upset). But the fact remains that such an exercise produces the same kind of valuable addition to the site.