The best collections library design, part 2
In the previous post we’ve looked at collection libraries of various languages, and concluded that we need a sequence abstraction. Mcneja and pipocaQuemada provided an excellent example on /r/haskell:
Suppose we want to implement a combined filter & map operation on Sets, without a roundtrip via lists:
pipocaQuemada suggested the following “So, we could define a pretty terrible version, pretty simply:”
Lets look at what this is doing:
- It maps
fover the set, which returns
Maybe a. That means that it will construct an ordered set represented by some tree structure that is ordered by the
Maybe. All the duplicate Nothings will be filtered out, leaving only one.
filter (/= Nothing)goes over all elements, and throws the one
Nothingout, and builds an entirely new ordered set represented as a tree.
- It removes all the
Justwrappers, and constructs the final ordered set.
This is an excellent example why
filter should not work on the collections, but on a sequence abstraction. Not only would that make
filterMap generic over all collections, it would also be much more efficient.
In this post we will evaluate the pros and cons of various options. A sequence abstraction will need to support operations such as
takeWhile, and more. As we will see, not all sequence abstractions support all those operations, and that’s where there’s trade-offs to be made.
One candidate are lazy lists. I’m going to use the strict language F# for this, to make it explicit where the laziness is happening. To define lazy lists we first need lazy values.
We define a type
Lazy<'t> which has two operations:
delay creates a lazy value from a function. The function is not executed until
force is called on that lazy value.
force also caches the result, so that if
force is called on the same lazy value multiple times, it only performs the computation once.
Using this we can define the type of lazy lists:
Lazy is also around the outer layer. This is important to make the list fully lazy. In particular, this would be incorrect:
Because for this type the outermost layer is a
Cons/Nil, which is not wrapped in
Lazy. We define
Pros and cons of lazy lists
A nice property of lazy lists are that they are a pure value, they support many sequence operations, and they can represent infinite sequences. That said, they also have several disadvantages:
- They can easily cause memory leaks. An innocuous function like
average xs = sum xs / length xscauses a memory leak: the whole sequence
xsis built up in memory during the
sum, and then that sequence is traversed again in
length. This means that there is a time at which the entire sequence has to be present in memory, and it can’t be garbage collected until
- Lazy lists create a lot of garbage even when there is no memory leak. This is mitigated by compiler optimizations such as stream fusion, but relying on compiler magic for performance is a not ideal. Predictable performance is important. Ideally the sequence abstraction should not create garbage by construction, so that if you write your algorithms in terms of the sequence abstraction, you’re guaranteed not to put pressure on the GC.
- They do not support parallel execution. A lazy list is an inherently sequential structure. In order to get to the next element you traverse a pointer. Consider mapping the function
x -> x+1over a big array. In principle this can be done in parallel by dividing the array into chunks. But if we transform the array to a lazy list, then
mapand then transform it back to an array, we’ve lost the parallelism.
- They do not support resource cleanup. If we want to process a file we want to close that file at the end of that process. Lazy lists do not support any deterministic way to clean up resources when the operations are done.
- They cannot represent push collections, such as event streams.
- Effects on lazy list operations, like mapping over a list with an effectful function f, are questionable because of the unpredictable order of execution.
- It does not support
dropWhilein an efficient way: you have to traverse the entire front of the list just to skip over elements.
Many imperative languages have the concept of iterator. I’ll use Python as an example. In Python an iterator is stateful object an object with a
next() method, which returns the value of the current element and advances the state to the next element. When the iterator is at the end,
next() throws a
StopIteration() exception. This is just a way to encode returning an optional element:
next : () -> Maybe a. Iterators in Python support generator comprehensions, which are like list comprehensions except they work over iterators. Iterators can also be consumed by for loops:
Python has language support for creating iterators using the
yield statement. This lets you easily define iterators with just a function:
map function can be defined easily:
Iterators have the benefit that to do iteration they do not need to allocate anything. Like lazy lists, they can support infinite sequences. Like lazy lists, they don’t support deterministic resource cleanup, effectful iterators are questionable, they don’t support push collections, and they do not support parallel execution. Iterators do not cause memory leaks like lazy lists do, but they have other problems. Suppose we wrote
This will not cause a memory leak, but it will not get the right answer. Iterators are stateful, and the
sum(xs) has already fully consumed the iterator
xs. This means that the
length(xs) won’t work. Python’s
itertools module does have a function that will clone an iterator, called
Tee sets up a dequeue. Every time one of its output iterators are advanced, it will add/remove from the dequeue as neccessary. If its output iterators are iterated over in tandem, then this operates in constant space. If one of its output iterators is first fully used, and after that the second iterator is fully used, then tee has to buffer the entire sequence in the dequeue. That’s what’s happening in the above example: it has the same memory leak as the lazy list version:
tee() will fully build the dequeue. At least the memory leak is explicit now…
C++ has an elaborate iterator model:
- Input iterators: single pass iterators, much like Python iterators and C# IEnumerators
- Forward iterators: iterators that can be copied
- Bidirectional iterators: an iterator that can move backward as well as forward
- Random access iterators: iterator that can jump to any index
- Output iterators: iterator that can only be written to
C++ iterators can be mutable or immutable. Mutable iterators can be used to modify the underlying collection at the index that the iterator is currently pointing to.
C# has an interface IEnumerator, which represents an iterator in a slightly different way (using Current and MoveNext() methods), but this difference is immaterial. The real difference with iterators is that an
IEnumerator has a
Dispose method, which you should call when you are done with it. This allows the
IEnumerator to clean up any resources like file handles.
The C# designers had a brilliant idea: you shouldn’t work directly with IEnumerators, you should work with IEnumerator factories. That’s what IEnumerable is:
The collection operations work on IEnumerable, not on IEnumerator! Suppose you have a pipeline like this:
(in C#, map is called Select, and filter is called Where)
This constructs an IEnumerable pipeline, but it does not actually run the pipeline until
Sum() is called.
Sum will call
GetEnumerator on the
.Select(x => x*2), which will in turn call
.Where(x => x < 100), which will call
.Select(x => x+1), which will call
xs. That process sets up the
IEnumerator (iterator) pipeline. Once
Sum is done summing, it will
Dispose the iterator that it’s iterating over, which calls
Dispose on the previous iterator in the pipeline, and so forth. If
xs represents a file, then the file can be opened when
GetEnumerator is called, and closed when
Dispose is called, so we have the ability to do resource cleanup.
What happens when we try to compute an average?
This will create the iterator pipeline twice. If
xs is itself a pipeline, then all the operations in that pipeline will be executed twice. That’s something to keep in mind with IEnumerables: they represent computations that can be executed, not collections that are data. If you want to buffer, you have to explicitly convert your IEnumerable to an array and back. This architecture means that we also get more predictable effectful operations. If we do this:
ys represents a computation that has effects when executed. Those effects are executed in a predictable manner, but one has to keep in mind that IEnumerables represent computations.
The benefits of IEnumerable:
- IEnumerables are pure values.
- IEnumerable takes care of resource handling.
- Effects are executed in a predictable manner.
- No implicit memory leaks.
- No memory allocation for iteration.
The main disadvantage of IEnumerable is that like lazy lists and iterators, they do not support parallelism nor push collections nor efficient drop/dropWhile.
The third party C# library Rx provides the equivalent of IEnumerable/IEnumerator for push collections. Push collections are collections where the producer rather than the consumer is in charge of when elements go through the pipeline. An example is a GUI button with an event handler that handles clicks. You can view the event stream of clicks as a collection. Since we can’t control when a user will click the button, the producer is in charge of pushing elements through the pipeline. Compare with IEnumerable where the consumer such as
Sum() is responsible for pulling elements out of the pipeline. When you’ve built up an IObservable pipeline, you can use it by subscribing an event handler on it. Here’s an example:
This creates an event stream of clicks, takes only the left clicks, then maps each click to its position, and subscribes an event handler that prints each position to the console.
IObservable has all the advantages and disadvantages of IEnumerable, except it works on push collections rather than pull collections.
The new hotness is Clojure’s reducers and transducers. Reducers and transducers make use of an isormophism between lists and folds. I’m going to use Haskell syntax to explain them, because it’s easier to understand that way.
For (lazy) lists we have three types of things we are working with:
- Lists producers:
[a], these can be constructed from a collection, or by some generating function like
range n k.
- List transformers:
[a] -> [b], like
scanl f xand
- List consumers:
[b] -> r, like
fold f x.
So we have:
[a] [a] -> [b] [b] -> r producers transformers consumers
List consumers can be written as a fold:
sum = fold (+) 0,
length = fold (+1) 0 etc. The
fold has an accumulation function as an argument of type
(state -> b -> state), and an initial
state. This brings us to the idea to represent consumers as a fold:
For some fold operations the state is not actually what we want to return. For example to compute the
average function, we may want to accumulate a pair of
(sum, length) as the state:
But eventually we want to return the average
sum / length. To do that we create a type
Fold that represents the final fold with one additional operation
state -> r:
However, there is no reason to expose the state in the type. We can hide it with an existential:
We can represent a fold that computes the average as follows:
We can use this to compute the average of a list:
That’s the consumer side, how about producers? By analogy with lists: if
[a] is a list producer, then
[a] -> r is a list consumer. If
Fold a r is a consumer, then
Fold a r -> r is a producer! Lets call this type Unfold:
What about the transformers? I’ll call the transformers Refold, and they transform FoldFn’s:
Note that in Clojure terminology, Unfolds are reducers, and Refolds and Folds are transducers (combining them into one is a design mistake if you ask me).
We can define composition functions that compose Refolds, Unfolds and Folds:
I’ll leave this as an exercise for the reader. To complete the analogy with lists:
Unfold a Refold a b Fold b r producers transformers consumers
We can define
Though it would of course be more appropriate to define flatmap like this, to stay true to our new sequence abstraction:
See also Tekmo’s foldl library.
Advantages & Limitations
This abstraction has several advantages:
- It supports writing
averagecompositionally and without memory leaks.
- It supports parallelism: a Refold operation can be executed in parallel.
- It supports both push and pull collections: Refold can work for both push and pull
- It executes pipelines without any memory allocation.
- You can execute effects deterministically (at least in an impure language).
It also has several disadvantages:
- It does not support automatic resource handling.
- It does not support
dropWhilein an efficient way.
- It does not support
scanl, or stateful transformers in general.
- It does not support
zip: there’s no efficient way to combine
Unfold a -> Unfold b -> Unfold (a,b).
Some of these are easily addressed: resource handling can be added. Efficient
takeWhile can be added: you can add early stopping to FoldFn: instead of
state -> x -> state you do
state -> x -> Maybe state, where
Nothing indicates that we want to terminate. You can add a way to keep state accross elements in Refolds, to support
scanl. More generally you can upgrade Fold to support any monadic effect. In Clojure, native mutable state is used for stateful transducers.
Once you add these features you can no longer support parallelism however. You need to make a choice: either you have left folds and stateful transformers, and not parallelism, or you have associative folds and stateless transducers but you get parallelism. (you could have an associative version of scan that does a parallel prefix sum)
There are many other options, like pipes or conduits, iteratees, enumeratees, etcetera. These all support some subset of the operations with a certain level of efficiency, but none are perfect on all points.
We would like to support all operations in all contexts (push, pull, parallel, and others), without any allocation, and all the other desiderate we’ve talked about. This is impossible: a stateful transformer is fundamentally incompatible with parallelism, for example. Zipping is incompatible with push collections & no memory allocation, and unzipping is incompatible with pull collections & no memory allocation. It seems that we’ll have to implement different abstractions for different contexts. We need push collections, with all the operations they support, pull collections with all the operations they support, parallel collections with all the operations they support, and maybe more for different circumstances.
This is unfortunate, because even though there are differences, there are also a lot of commonalities. All three support
filter, for example. Moreover, we’d like to support
map in a broader context than just sequences. There are many types that can be mapped (i.e. are an instance of Functor):
Either a b
- An infinite lazy binary tree
a -> b
We would like to be able to map pipelines
map f . map g over infinite binary trees without building an infinite intermediate lazy binary tree in the middle. We should be able to do filter pipelines
filter p . filter q without intermediate allocations too. Even mixes
map f . filter p . map g should be no problem. You could try to be clever here, and flatten the binary tree to a sequence by interleaving the left & right subtrees, and then at the opposite end of the pipeline “uninterleave” the sequence back into a binary tree. That would make it terminate, but it would still be very inefficient. What if we only needed one path down the resulting binary tree? Then we would need to skip over many sequence elements. No, the reality is a universal sequence abstraction isn’t going to cut it.
Glimmer of hope
You probably already got where we’re going here. Instead of composing producers, like iterators/enumerables, and instead of composing consumers, like folds, we compose transformers. We use laws such as:
as the definition of
filter. To support multiple different operations we need to find a more general representation that can represent all such compositions:
We compose transformers, and the type of the composition is as specific as possible, so that it can be used in as many contexts as possible. Any context that supports
map can support a composed pipeline of
map’s. Any context that supports
filter can support a composed pipeline of
filters. Any context that supports
filterMap can support a composed pipeline of
More on this in part 3.