Warning: this is something of a rant. Don’t say I didn’t warn you.
Something that continues to frustrate me is the proliferation of poor memory management decisions in tools and libraries for tools. The default is to allocate every single tiny piece of memory with malloc and then free it later.
Case in point: JSON parser libraries. Most of them allocate tiny JSON values individually from the C library heap. It makes absolutely no sense. The way most command-line tools work with JSON data is to slurp in a data file, parse it to a JSON structure, process that data and then exit. They typically don’t sit around holding a pointer to a small fragment of a JSON document, it’s all managed together. The choice to allocate JSON values on the CRT heap individually is terrible in a scenario like that. What we should do instead is to exploit the higher-level knowledge about the memory usage patterns.
A lot of people are now preparing their rebuttal detailing how fast a general heap manager can be and that we can optimize the allocator — I call bullshit on all that, and here’s why.
To prove how much of a difference this makes in practice, I modified the popular json-parser library so I can select at runtime whether to use the CRT heap or a custom page-based linear allocator. The program simply reads a block of JSON data into memory and parses it 50 times, immediately throwing away the parsed data.
You can check out the code here on github.
Here are the results:
$ time mmtest game.json heap real 0m1.642s user 0m1.589s sys 0m0.051s $ time mmtest game.json linear real 0m0.755s user 0m0.697s sys 0m0.057s
The stock version using the CRT heap is 2.2x slower!!
Running with valgrind to trace the heap activity reveals the (only) difference between these runs:
heap: 7,024,565 allocs, 7,024,205 frees, 337,671,820 bytes allocated linear: 615 allocs, 255 frees, 420,426,770 bytes allocated
No wonder it’s faster! That’s a LOT of cache misses and pointless branches we’re sidestepping entirely in the malloc/free functions; more than 7 million calls. Sure, we’re wasting 90 MB of RAM, but do you really care in a command-line program that does its thing and then quits? If you’re worried about it, it can be trimmed down easily. I used a block size of 1 MB in my linear allocator for this test. Shrinking it down to 128 kb will still give you almost the same benfit. (The mismatched allocs/frees are from Apple code outside my control, go figure.)
The lesson to take away from this is simple: Never make memory management decisions at the lowest level of a system. Always allow the user to control the memory management policy on the highest possible level, where its possible to do high-level optimizations safely.
As a bonus C++ bashing, this is why anything based on the STL sucks, as it by definition takes microscopic decicions about where memory comes from and how long it will live. The standard response to MM performance concerns is to “replace the allocator with a more efficient one”. This is the wrong answer, as I hope you can clearly see above. The fastest code is the one you never have to run.
Rant over. 🙂