PHOTOOG Photography writings by Olivier Giroux

15Nov/082

map_vector and multimap_vector

Earlier this year we had the honor to host Scott Meyers at Nvidia for a one-day class. One of Scott’s key messages was that people often abuse associative containers when they could simply use a “sequential container + algorithms” combo. One example of this was dubbed the “map-like vector”, which is a vector layered with std::sort/equal_range.

Introduction

The map-like sequence pattern, henceforth map_vector, has a few key advantages that make it appealing in a variety of situations. First, its memory is much more compact, particularly for simple types. Map_vector fits all objects in a single linear piece of memory to which it has random-access internally, and the iterators that it exports boil down to basic C pointers. Second, it has the option to defer the work to order its contents until you make queries that actually require it. In this case insertions are O(amortized-1), the first query following an insertion is O(n*log-n) and all subsequent queries are O(log-n).

You can get clever in all sorts of ways, but basically the map_vector has the same worst-case as the map, while it improves the best-case and lowers constant factors drastically. Constant factors are not to be underestimated!!

This pattern is particularly suitable in the case where you don’t interleave insertions and queries, as you can imagine. For example, your program might build up a dictionary during startup and then refer to the dictionary as an immutable object the rest of the time. The pattern’s pathology is clearly in the case where you perfectly interleave insertions and queries – the program can become O(n*n) or O(n*n*log-n) very easily depending on the naïveté of the implementation.

The bottom-line here is that this is a specialized container with an imbalanced performance curve. It is better in a few scenarios, enough to warrant developing it, but it is not something you can just go and use without thinking. (is anything like that?)

Zoom to November. I just spent time this week to implement map_vector<> and multimap_vector<> in a way that would make it palatable to the ISO. Here I report my design decisions I had to make, and their impact on usefulness of the pattern.

Interface

The public definitions are the following (verboseness edited for email):

namespace stdext {
template< class Key ,
class Value ,
class Predicate = less< Key > ,
class Allocator = allocator< pair< Key , Value > > ,
class Container = vector< pair< Key , Value > , Allocator > ,
class Sequencer = unique_value_sequence< ... > >
struct map_vector : public map_like_vector< ... >;
template< class Key ,
class Value ,
class Predicate = less< Key > ,
class Allocator = allocator< pair< Key , Value > > ,
class Container = vector< pair< Key , Value > , Allocator> ,
class Sequencer = multi_value_sequence< ... >
struct multimap_vector : public map_like_vector< ... >;
}

The complete interface of associative containers is supported, with the following extensions:

  1. Iterators are random-access by default, they’re a typedef to the Container policy’s iterator.
  2. Keys are mutable so you can change them while iterating. The caveat here being that if you abuse this extension then all bets are off, you’re responsible for not breaking order.
  3. The back-end Container policy is user-replaceable. You can use std::deque, for example, which could be suitable for very large maps. You could write your own too, of course, or get one from Boost.
  4. Sequencer policy is user-replaceable. This policy object manages the Predicate policy and performs insertions and queries on the Container policy.

Notes: map_vector

The short of it is that nobody’s going to like map_vector. It blows chunks.

One of the first things I realized when I set out to write the library is that the map_vector class itself cannot in fact achieve O(1) insertions and still retain drop-in compatibility with std::map. The problem is that std::map promises that each key will only be associated with one value (at all times!). For us, this promise implies either an O(n) search followed by an O(amortized-1) insertion, or an O(log-n) search followed by an O(n) insertion.

The uniqueness promise is not something I was willing to weaken (even if temporarily) to achieve my goals, because someone (me, for instance) might rely on map uniqueness to spill a very large amount of duplicate data into a std::map expecting that only a few records will really be created. The coalescing function of the map does matter to some of its users, it’s part of its “soul”.

Hence map_vector is *always* sorted in my implementation. The insertion operation is a sequence of lower_bound followed by a vector insertion with placement. Note that the call to vector<>::insert is O(n) because in general the placement is not end( ).

The conclusion for map_vector is sobering for the excited map-like vector crowd. Unless you relax the guarantees, and make yourself a map-incompatible interface, you will not see superior performance in very many cases. Inserting N distinct elements into this container is basically O(n*n) cost, so for N larger than the constant factor savings the container is rather repulsive. With more implementation effort this cost could be lowered for the specific case where you assign to the container from a range, combining an O(n) range insertion, O(n*log-n) sort, O(n) duplicate-relocation pass and an O(n) erase call.

The performance benefits of the basic map_vector implementation will come only from memory compaction / reduced indirection, and only if one or more of these conditions are met: (1) the number of queries far outweighs the number of unique insertions, (2) the number of duplicate insertions far outweighs the number of unique insertions, (3) the number of elements involved is tiny.

Notes: multimap_vector

This is the real map-like vector. It rocks!

It’s a whole different story for multimap_vector. The multimap interface doesn’t promise uniqueness, and it will accept all values that come along. In this case I could deliver the full promise of the map-like vector idea.

The implementation of insert is trivial, basically the same as push_back. A Boolean flag remembers if the sequence is sorted, which is only invalidated by insertions.

As in the case of map_vector, all queries are expressed in terms of lower_bound. The difference here is that lower_bound requires an explicit sort for the multimap_vector when it is not already sorted.

Before you ask, the “const” version of query functions strip off const-ness and do the same thing. The argument is that ordering the container doesn’t alter its true meaning – it always projects the idea that it is sorted. Const-ness of the container is not the same as const-ness of its contents in this case, and the latter is preserved in all cases where the multimap_vector is const on the client interface.

Performance-wise, the multimap_vector is as you would expect. Insertions are O(1) in all cases. Queries are all O(log-n) except the first one immediately following an insertion. I know there is room for more cleverness here, but I felt simplicity of the algorithms was the key to use the data structure properly.

There are two conclusions that I draw.

First, if you *know* you don’t have any duplicates in your data set then you don’t need the guarantee of map_vector. You should just use multimap_vector and get the better performance for free. Even if you don’t know whether there are duplicates, if you are willing to accept temporary memory bloat then you can massage this container after the fact (same as multimap).

Second, if you are using an std::multimap today, then this is a drop-in replacement in any situation suitable for the map-like vector pattern. No big caveats, just better performance in the right places.

Cheers,

Olivier


Filed under: Uncategorized Leave a comment
Comments (2) Trackbacks (0)
  1. Great article.

    begin and end trigger a sort? All queries and insertions invalidate iterators? If so then then it’s not a complete drop-in.

    What’s the “more on this later” from point 2?

  2. Begin and end do *not* trigger a sort in my implementation, they forward directly to the underlying vector. Only find, lower_bound, upper_bound and equal_range sort prior to executing.

    This does potentially expose another difference where a map or multimap would always appear to be sorted for a begin-to-end traversal. This is a nice side-effect of maps, which I’ve relied on more than once, but not one that I consider to be truly essential to their function as associative containers. I don’t mind that hash_map doesn’t have this property, for instance, and they are *much* more expensive than std::sort all other things being equal.

    Indeed iterators are (potentially) invalidated by both insert and query operations, which does reduce the drop-in-ability. This is the 2nd design call that had to be made. People who keep outstanding iterators for a long time as if they were pointers are – in my opinion – abusing the abstraction. They’re free to do so and the standard totally supports them, but it’s not the most brilliant of ideas.

    The guarantees of the map_vector are thus a combination of the map and the vector guarantees. Does that make sense? It’s in the name. ;^)

    The “soul” of the map-like vector in my mind is that the population phase and use phase for the container are cleanly separated. That’s when the container works well, and why you should use it. In this case the iterator invalidation and sorting issues are moot points, because once you start taking out iterators from the container it has become sorted and immutable.

    So for many of my use-cases it’ll basically be a drop-in replacement. I expect to tweak the code a bit, sure, but I won’t need to spend very much time doing that.

    As for the 2nd section reference… I just forgot to get back to that. ;^)


Leave a comment

No trackbacks yet.