Bjarne Stroustrup, in his Going Native 2012 presentation, talks about a piece of example code that does some routine things with std::vectors and std::lists. He meant to present the resulting timing graphs, but the chart was missing from his presentation (it doesn’t matter that the graph was missing; everybody should know what the result is.)
A couple of weeks ago, my friend Amir H. Fassihi had a talk about video game optimization, so he and I sat down and wrote a couple of small programs to measure some simple stuff. The vector vs. list program was one of them. I’m going to present here the results and graphs that I hope would be analogous to Stroustrup’s missing one. I should also mention that while I was at it, I measured some other aspects of the same program, like the performance of Debug vs. Release builds and the effects of Microsoft’s iterator debugging “feature” on performance.
I did all these tests using Microsoft Visual C++ 2012 (the newly released final version, a.k.a. VC++11, a.k.a. Microsoft C/C++ compiler version 17.00.50727.1 for x86) on Windows 7 64-bit, on a machine with an Intel Core i7-920 CPU and a lot of DDR3-1066 memory. All my builds were 32-bit.
The Test Case
The code does two simple things. It fills a container (std::vector or std::list) with many thousands of random integers and keeps the numbers in order. It generate a number, finds the place in the container it should be at (and does that in a linear and stable manner,) inserts it there (which in case of vector, means moving everything after it one place over and possibly a reallocation and copy.)
After that the code erases the numbers from the container one-by-one, using the same numbers that was used to fill it, and does it in the same order. Again it searches for the number linearly, and in the case of vector a lot of numbers need to be moved around and possibly a reallocation and data copy needs to be performed.
Note that with a sorted vector, I could have easily done a binary search, but I didn’t do that (you know, to be fair to std::list!)
Here are the two functions that do the insertion and deletion of those many numbers:
template <typename ContainerType> // Assuming ContainerType is std::vector<int> or std::list</int><int>
void Insert (ContainerType & c, unsigned n, unsigned seed)
// Insert a sentinel to simplify the logic inside the loop.
// The sentinel never occurs in data stream. The Rand() function ensures of this.
for (unsigned i = 0; i < n; ++i)
auto d = Rand(); // This function generates an integer in [0, 1'000'000)
for (auto j = c.begin(), je = c.end(); j != je; ++j)
if (*j > d)
c.insert (j, d);
c.pop_back (); // Remove the sentinel
template <typename ContainerType>
void Delete (ContainerType & c, unsigned n, unsigned seed)
for (unsigned i = 0; i < n; ++i)
auto d = Rand ();
for (auto j = c.begin(), je = c.end(); j != je; ++j)
if (*j == d)
The rest of the file is timing code and loops that run these many times for different values of “n” and reporting code.
Here’s the full test source code: vector_vs_list_win32.cpp. Note that you must edit (a couple of lines in) this file to run with different containers and iterator debugging levels. The edits should be self-evident.
So without further ado, this is the result of two runs of the program, optimized in both runs, for std::vector and std::list, for lengths from 5’000 to 100’000 in increments of 5K:
Figure 1. Run of optimized program (i.e. Release mode) for std::vector and std::list
The vector out-performs list by a large margin. It basically blows it out of the water. This result should not surprise any experienced programmers, but unfortunately it does many of us. I want you to think about it a little bit: the test case is almost ideal for linked-lists. It does only linear access (no binary searches and other shenanigans) and inserts and deletes a lot of items. Vectors are supposed to perform badly under these situations. There are a lot of pushing stuff along, one by one, and quite a few copying of the whole vector when the need for a re-size and reallocation arises. I could have pre-allocated the vector, but I didn’t. I could have done a binary search, but I didn’t. I could have put all the numbers in and then sorted them, but I didn’t. This is an almost pathologically bad case to use a vector.
The vector still kills the linked list in this test. And that’s all because vector stores its data in a contiguous block of memory. The vector has so much better performance (and as you increase the size of your data structure, the difference in performance becomes much more pronounced) because it accesses memory linearly.
The memory is so slow when you access it randomly, that the penalty for cache misses almost always dominates your CPU usage time. If you care at all about performance, the main memory should be considered a sequential-access device. You have been lied to all your life; RAM is not a random-access memory device!
This has been true in the last few years and in the future it will only become more so, barring any true revolutions in hardware technology (think Ansible!) or a fundamental change in computing model (think abandoning the von Neumann model,) of which there is no sign yet (and it would take decades, even if we had an alternative model.) This particular performance killer will get worse and worse with time.
By the way, I actually ran the vector vs. list for much larger numbers of elements. I went up to 800’000 elements and the list took 40 times the time to do the operations (about 4 and a half hours.) I didn’t include those results because I had changed the code a little bit and didn’t want to spend another 48 hours to run the full set of tests up to a million. Besides, the extreme difference in those cases would have made the charts unusable.
A Word About Memory Allocator Pressure
In my test program, I start from small values of N and go successively upwards. On the way, the program allocates and frees a lot of memory (specially the list version.) It is theoretically possible that after those many (de)allocations, the memory is fragmented too much (shouldn’t happen, because all those allocations in the list test are of the same size) or the internal data structures of the memory allocator are in such a state that make later allocations slower. This, should it happen, would be a problem specific to this test and wouldn’t happen if I just have run the test for a single specific (large) size.
Of course, that’s a theoretical problem. I did run the test for single specific sizes and didn’t see any meaningful differences from the numbers presented in the above chart. So, the allocator in VC++ 2012 and Win 7 at least doesn’t have these problems. (The allocator seems quite good actually. I haven’t done any real benchmarks (don’t know how!) but it looks very good.)
The Iterator Debugging Fiasco
Let me give you some background information about iterator debugging first. For years now (at least since VS2005) the Standard Template Library that is shipped with VC++ has had some debugging and security “features”. And I’m not talking about a handful of compile-time and run-time assertions here and a handful of NULL checks there.
Among other things, every instance of an iterator could (and probably did) store a pointer to its parent container (which adds 50%-100% to the size of most iterators) and had to keep this pointer current and correct which requires some bookkeeping and did a lot of checks against that on almost all operations (e.g. to see whether a begin() and end() passed to a function belong to the same exact container or not, etc.) Let’s call this category of “features” security features.
Also, every container could store a linked list of all iterators created against itself, and had to keep this list up-to-date (oh, the horror!) so that it could (among other things) notify all its iterators that it is being destructed and they are becoming invalid. This also adds to the size of the iterators, in addition to the obvious performance penalty. Let’s call this category of “features” debugging features.
The basic causes of the problems that ensued (which I will describe in a moment) were that having these “features” enabled produced a large and certainly noticeable performance hit, and also it changed the in-memory size and layout of STL containers and iterators. The performance thing is obviously undesirable, but at least most of us are used to trading off some performance to gain some correctness or robustness or reliability or security or whatever (although in this case, it wasn’t always some performance. It’s not hard to find cases that are several hundred or thousand times slower with these “features” enabled.)
The change in size and layout these “features” brought with themselves, caused a horrible problem: you couldn’t link a module that had either of these sets of features enabled with one that didn’t. I mean you could, but you had to be very very very careful and redesigned all your interfaces and reviewed all your code and some other very “trivial” things! (You couldn’t link them together because then your program would have violated the “One Definition Rule” of C++; which basically states that objects – functions, structs, classes, etc. – must have one and only one definition, which most certainly includes size and memory layout. Google “ODR” if you don’t know exactly what it means and why it is important.)
All this so far is quite inevitable from the point you put those “features” in onwards. I mean, sure, Microsoft could have introduced a special mode that would change the size and layout of data structures but not do the checks and not have most of the overhead. But hey, how can you expect them to put in a (partial) solution to a problem they have created out of the blue?!
And then they went on and added real insult to injury: they enabled those features by default (all of them in Debug mode, the “security” “features” in Release.) I don’t know whether you can imagine how horrible this was. Either you payed the performance penalty (sometimes several orders of magnitude,) or you had to make sure every piece of code you linked together had the same consistent state for these “features”. And this, unfortunately, is a lot harder than it should be, even with opensource software and libraries.
But fear not! Microsoft decided to (partly) do the right thing with VS2010 (VC++10) and disabled these “features” by default in Release mode. The full insanity is still in force in debug builds, unless you explicitly disable it.
The funny thing (funny like “you just cut my stomach open and spilled all my guts on the floor, ha ha!”) is that I have been using all these versions of Visual C++ almost exclusively over the past decade and I have been programming in C++ almost exclusively for all that time, and I have been doing only programming almost exclusively as a job and hobby and education, and (wait for it) I have never ever seen an error or exception or assertion or something that would mean that these “features” have found a semantic or logical bug in my code. NEVER. (This have happened in experimental code I have written to explore these exact “features” though. Oh, and I think once I did get a runtime error when I was demonstrating a deliberately wrong use of STL iterators to a colleague. But that’s it.)
Anyways, the way you control these “features” now is through a macro: _ITERATOR_DEBUG_LEVEL. By setting its value to zero, you disable all these “features”. If you set it to 1, you enable the security checks (internally, this sets _SECURE_SCL to 1, if you are interested,) and setting iterator debug level to 2 (which is only possible in Debug builds) enables both “security” and “debugging” “features” (that is, internally sets both _SECURE_SCL and _HAS_ITERATOR_DEBUGGING to 1.)
Note that iterator debugging features cannot be enabled in release builds (thank goodness!) only the security stuff. And you must make sure that these setting are consistent (i.e. the same) in all your translation units linked together, or be ready for a horrible and painful death.
Another redeeming step that Microsoft has taken is that its linker now actually detects whether the values of these macros are difference in different object files being linked together and emits an error at link time. This is done via #pragma detect_mismatch which you can use too, but only if you really really absolutely need it. I don’t recommend its use at all. (Actually, I don’t recommend doing anything that would make you need that pragma, not the use of the pragma itself per sé.)
Let me reiterate: by default, _ITERATOR_DEBUG_LEVEL is at 2 in debug builds and at 0 in release builds since VC++10. So you have less worries.
Debug vs. Release, Iterator Debugging Levels and Vector vs. List
I wanted to see how much of an impact does building in debug mode has on performance. While I was at it, I figured I would test the effect of iterator debugging too. Of course, this particular test case might not be too representative of the extent of the impact of debug iterators. It doesn’t make a lot of them. But it’s a test case, right?!
So, I ran the above code in 10 configurations: 5 times for std::vector and 5 times for std::list. Of each of those 5, two were release builds (with iterator debugging level at 0 and 1) and three were debug builds (with iterator debugging level at 0, 1 and 2.) And here are the results:
Figure 2. The slower tests are charted to 50’000 items only to make it more readable and comparable. Also, “vector (D1)” means a test run with std::vector, in Debug mode and with _ITERATOR_DEBUG_LEVEL set to 1. Others are similar.
Clearly, the results are divided into 4 or 5 groups:
Last and least are std::list(D1) which for some reason is slower than everything else, and list(D2), vector(D1) and vector(D2). These are the slowest of the bunch, and very excruciatingly so. The surprising aspect is the performance of vectors in debug mode and with _ITERATOR_DEBUG_LEVEL at more than 0. They match perfectly with performance of linked lists under the same circumstances. It might be interesting for you to know that the performance of vector(D1) is consistently 350-380 times slower than vector(R0)!
Then there are list(D0) and vector(D0) which, although do have a visible difference in performance themselves, can be said that run in about half the time of their (D1) and (D2) counterparts. Which is nice; showing that Microsoft took something slow (containers at D0) and made it twice as much so, by adding iterator debugging! Again, note that vector basically turns into a linked list (performance-wise) in debug mode.
Then there are release runs. Slower among them are (not surprisingly) list(R0) and list(R1) which have no significant difference in performance.
Then comes vector(R1) which consistently takes about 4.2 times that of vector(R0) to run.
And vector(R0) is obviously the fastest of them all.
Here are the raw timing numbers, in case you are interested:
|Table 1. The raw data that the above charts are generated from.
It is probably always worth is to linearize your data and flatten your data structures. This, instead of hierarchical data structures and pointer spaghetti should be your default choice of data structures. Again: unless you have proven otherwise, a flat vector of data (not pointers) should be your default data structure for everything!
Think of vector vs. linked list this way (I’m paraphrasing Stroustrup here): A linked list is a sequential data structure, but when you walk it sequentially, you are randomly jumping around in memory. A vector is random-access data structure, but when you walk it sequentially, you are also accessing memory linearly and very nicely.
Look at Figure 2, at the slowest performers. Two vector runs are in there as well as two list runs, and they have almost exactly the same performance. It shows that it doesn’t take much to ruin a great thing: just some unconscious and uninformed decisions.
I could have made a functionally equivalent version of the test for vector that would be much faster than this; by doing a binary search and by preallocating the vector and by employing other sort strategies (instead of essentially doing the worst things possible short of shuffling it randomly until it is sorted!) And I could have done all this without much more code, maybe 20-50 more lines. None of these optimizations are possible with a list. You have to go for other, far more complicated data structures to accomplish some of these (skip lists, balanced BSTs, etc.) I might do the optimizations one day to see how a faster version of vector compares to std::set for example.
Building in debug mode and using iterator debugging features of VC++ very effectively turns your vector into a linked list!
Both levels of iterator debugging are toxic to STL performance. It’s not 10% or 15%. It’s several hundred times! By the way, as I mentioned before, our particular use of iterators and containers doesn’t trigger the horrors of full iterator debugging. It’s a whole other level of hell by itself.
Unfortunately, but somewhat unsurprisingly, some programmers “know” without reason or equivocation that STL containers are slow. That is absolutely not true. You need to know what’s going on under the hood, and make a case-by-case decision, or at least time your fucking code! Do not automatically assume anything, specially about performance.