Fast mmap()able data structures

For version 2 of my build system Tundra, I’m using a DAG data structure that is designed to be mapped straight into memory. The idea is to only regenerate this data (which can be huge for a game-sized project) when needed, such as when tinkering with compiler options or adding files to the project. These things happen much more rarely than actually issuing a rebuild. This way a regular build will only need to spend about a millisecond (mmap is fast!) to get the DAG data and start building.

As I was working on this problem I went back and forth on the data representation in my files. A DAG is typically very “pointery” – there are lots of references to strings, dependencies and other things that either need an offset or a pointer because they’re variable sized and can’t be stored inline.

I’d identified two main contenders:

  1. Use indices for every reference, no pointers.
  2. Use pointers where it makes sense.

Using indices is great because indices take less space than pointers on 64-bit systems. But they’re sometimes more cumbersome to work with because there’s implicit context (the array you’re indexing into) that must be kept around everywhere. Plus it’s harder to debug a large complex data structure glued together with indices. Also the “null reference” typically becomes -1, and that needs to be explicitly handled everywhere or you’ve got a memory error.

Using pointers is great because it’s simple to work with, and works well in the debugger too. But they’re bigger than indices and must be fixed up after the data structure is mapped into ram (unless you can somehow map it to a fixed address which is brittle and something I wanted to avoid.) This means you need to store an additional block of data telling you where the pointers are so you can fix them up – and that’s not free either. The basic way to fix up a pointer is to add the base address of the mmap’d segment to it, a R-M-W operation on random 8 bytes of RAM, which pretty much means a guaranteed cache miss. If you have 100s of thousands of them, it really stats to show as a bottleneck.

So right now I’m using a hybrid approach of these two where pointers are encoded as a signed 32-bit integer, giving you the relative offset of the target byte with respect to the pointer itself. Although I’m usually not a fan of C++ templates, this is one case where they help with type safety and reducing casts/mistakes:

template <typename T>
class FrozenPtr
  int32_t m_Offset;
  const T* Get() const
    uintptr_t base = (uintptr_t) this;
    uintptr_t target = base + m_Offset;
    uintptr_t null_mask = (0 == m_Offset) ? 0 : ~uintptr_t(0);
    return reinterpret_cast<T*>(target & null_mask);
  operator const T* () const { return Get(); }

  // ... other stuff, disabling ctors/dtors and so on

The magic happens inside the Get() member function – we take the address of the pointer itself and add the relative offset in bytes to it. We then mask to see if the offset is zero (that’s how null pointers are encoded) and return the pointer.

This gives us two nice advantages: pointers are only 4 bytes, and you don’t need to fix them up. They are however not debugger-friendly and there’s is a slight cost to dereferencing them. For my use case, it was a pretty good tradeoff. With this pointer template I just mmap the data for my DAG file and immediately cast it to the top-level struct type I care about.



One thought on “Fast mmap()able data structures

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s