Optimizing immutable stacks & queues in C#
Microsoft released an immutable collections library on NuGet. It contains immutable stacks, queues, vectors, sorted maps & sets, and hash maps & sets. I reimplemented immutable stacks and queues to see if we can go faster.
Immutable Stacks
An immutable stack supports the following operations:
In addition there should be a way to get an empty stack. The benchmark that I will use is the following:
- We start by putting an int on the stack, for example 25.
- We repeatedly pop the top of the stack, k, and then we push 1..k-1.
- We terminate when the stack is empty.
For example when we start with 4, it goes like this:
4
3 2 1
2 1 2 1
1 1 1 2 1
1 1 2 1
1 2 1
2 1
1 1 1
1 1
1
end
You can imagine that this creates quite a long process of pushing and popping when you start with a big number:
25
24 23 22 21 ... 3 2 1
etc
Linked list stack
The most obvious implementation is as a linked list. I wrap all data types in a struct in the style of ML functors:
We represent an empty stack with new End()
, Push allocates a new node that points to the rest of the stack, and Pop just follows the rest
pointer. We can also represent the end of the stack with null
. Because we can’t invoke methods on null
, we have to wrap it in a struct, so that we can do Empty.Push(x)
and Empty.IsEmpty
. Here’s the full implementation:
Buffered stack
It seems hard to do better than this: pushing is allocating 1 object, and popping is dereferencing 1 object. How can we possibly push with less than 1 allocation or how can we pop with less than 1 pointer dereference? We can exploit C#’s value types. The idea is to represent the stack with a struct and optionally store the top of the stack inside that struct:
The stack starts out empty, with count = 0
, and rest = null
. When we push an element on the stack, we store it in item
in the struct. So we don’t need to do any allocation, because it’s a struct! Then when we push another element on the stack, we already have the one stored in item
, and a new element x
. Now we create a linked list node with TWO items inside it: item
and x
. We can continue like this: every time the stack has an odd number of elements, the top of the stack is stored in the outer struct. We only need to allocate a new list node for half of the pushes! Similarly, we only need to dereference rest
for half of the pops. Here’s an example:
operation struct linked list
------------------------------------------
Empty - -
Push(1) 1 -
Push(2) - (2,1)
Push(3) 3 (2,1)
Push(4) - (4,3)-(2,1)
Push(5) 5 (4,3)-(2,1)
Push(6) - (6,5)-(4,3)-(2,1)
Pop()->6 5 (4,3)-(2,1)
Pop()->5 - (4,3)-(2,1)
Pop()->4 3 (2,1)
Pop()->3 - (2,1)
Pop()->2 1 -
Pop()->1 - -
Benchmarks
Here are the results:
Implementation | Time | Memory |
---|---|---|
Collections.ImmutableStack | 3028 ms | 32000000 bytes |
LinkedListStack | 2943 ms | 32000000 bytes |
NullLinkedListStack | 2851 ms | 31999936 bytes |
BufferedStack | 8468 ms | 15999936 bytes |
BufferedStack uses less memory (as expected), but unfortunately it actually runs much slower than a plain linked list. So much for that theory. The best immutable stack is NullLinkedListStack, which is a little bit faster than Microsoft’s Collections.Immutable.
Immutable Queues
An immutable queue supports the following operations:
This is exactly the same as a stack, but the behavior is different: a stack pushes and pops from the same end, a queue enqueues and dequeues from opposite ends. This is a bit more difficult to achieve with an immutable data structure. We can’t simply use a linked list, because although we can efficiently put an element on the front of an immutable linked list, we can’t efficiently remove an element from the back of a linked list. The standard solution to this is to represent the queue as a pair of stacks front
and back
, representing the front of the queue and the back of the queue. When we enqueue an item, we push it onto back
. When we dequeue an item, we pop it from front
. The trick is that when front
is empty, and we want to dequeue an item, we pop all items from back
and push them on front
first. Here’s an example:
operation front back
------------------------------------------
Empty - -
Enqueue(1) - 1
Enqueue(2) - 2,1
Enqueue(3) - 3,2,1
Enqueue(4) - 4,3,2,1
now when we dequeue, we pop all elements from back and push them on the front,
effectively reversing the stack:
1,2,3,4 -
Dequeue()->1 2,3,4 -
Dequeue()->2 3,4 -
Dequeue()->3 4 -
Dequeue()->4 - -
The time complexity is amortized O(1) because each element we put on the queue is only part of a reversal once. Unfortunately with immutable queues amortized complexity doesn’t work so well, since after you do an operation the old version of the data structure is still there. So it could be the case that somebody dequeues repeatedly on a queue with an empty front, causing the expensive reversal operation to happen multiple times. There are ways around that, described in Chris Okasaki’s thesis. Getting O(1) complexity even if older versions can be used multiple times makes the data structure more complicated and costs us in terms of raw speed, so I will not do that.
Here’s the full code:
Can we optimize this? Yes we can: at the moment when we reverse the back
stack, there is no reason to put them into front
as a linked list. Since we are traversing the entire back
stack anyway, we may as well store front
efficiently in a contiguous array. Then when we Dequeue an item, we simply decrement a counter instead of copying the entire array. This keeps the GC from reclaiming the memory of elements that have already been dequeued until front
is completely empty, however. Here is the full code:
Benchmarks
For queues I use the same benchmark as for stacks, except instead of pushing and popping at the front, we enqueue and dequeue at opposite ends.
Here are the queue results:
Implementation | Time | Memory |
---|---|---|
Collections.ImmutableQueue | 2570 ms | 63999944 bytes |
DoubleStackQueue | 1266 ms | 63999904 bytes |
StackArrayQueue | 917 ms | 36194368 bytes |
The StackArrayQueue is clearly superior to the others: it is far faster and uses far less memory.
You can find the full code including benchmarks in the ImmutableCollections Github repository.