Fast mmap()able data structures

For version 2 of my build system Tundra, I’m using a DAG data structure that is designed to be mapped straight into memory. The idea is to only regenerate this data (which can be huge for a game-sized project) when needed, such as when tinkering with compiler options or adding files to the project. These things happen much more rarely than actually issuing a rebuild. This way a regular build will only need to spend about a millisecond (mmap is fast!) to get the DAG data and start building.

As I was working on this problem I went back and forth on the data representation in my files. A DAG is typically very “pointery” – there are lots of references to strings, dependencies and other things that either need an offset or a pointer because they’re variable sized and can’t be stored inline.

I’d identified two main contenders:

  1. Use indices for every reference, no pointers.
  2. Use pointers where it makes sense.

Using indices is great because indices take less space than pointers on 64-bit systems. But they’re sometimes more cumbersome to work with because there’s implicit context (the array you’re indexing into) that must be kept around everywhere. Plus it’s harder to debug a large complex data structure glued together with indices. Also the “null reference” typically becomes -1, and that needs to be explicitly handled everywhere or you’ve got a memory error.

Using pointers is great because it’s simple to work with, and works well in the debugger too. But they’re bigger than indices and must be fixed up after the data structure is mapped into ram (unless you can somehow map it to a fixed address which is brittle and something I wanted to avoid.) This means you need to store an additional block of data telling you where the pointers are so you can fix them up – and that’s not free either. The basic way to fix up a pointer is to add the base address of the mmap’d segment to it, a R-M-W operation on random 8 bytes of RAM, which pretty much means a guaranteed cache miss. If you have 100s of thousands of them, it really stats to show as a bottleneck.

So right now I’m using a hybrid approach of these two where pointers are encoded as a signed 32-bit integer, giving you the relative offset of the target byte with respect to the pointer itself. Although I’m usually not a fan of C++ templates, this is one case where they help with type safety and reducing casts/mistakes:

template <typename T>
class FrozenPtr
{
private:
  int32_t m_Offset;
public:
  const T* Get() const
  {
    uintptr_t base = (uintptr_t) this;
    uintptr_t target = base + m_Offset;
    uintptr_t null_mask = (0 == m_Offset) ? 0 : ~uintptr_t(0);
    return reinterpret_cast<T*>(target & null_mask);
  }
  operator const T* () const { return Get(); }

  // ... other stuff, disabling ctors/dtors and so on
};

The magic happens inside the Get() member function – we take the address of the pointer itself and add the relative offset in bytes to it. We then mask to see if the offset is zero (that’s how null pointers are encoded) and return the pointer.

This gives us two nice advantages: pointers are only 4 bytes, and you don’t need to fix them up. They are however not debugger-friendly and there’s is a slight cost to dereferencing them. For my use case, it was a pretty good tradeoff. With this pointer template I just mmap the data for my DAG file and immediately cast it to the top-level struct type I care about.

 

Loading from Imperfect Data (GDC 2013)

Here are my slides (with notes) from my talk “Loading from Imperfect Data” which I gave at GDC 2013.

Send any feedback my way!

Download slides here (PDF)

Tool/library memory management – You’re doing it wrong

Warning: this is something of a rant. Don’t say I didn’t warn you.

Something that continues to frustrate me is the proliferation of poor memory management decisions in tools and libraries for tools. The default is to allocate every single tiny piece of memory with malloc and then free it later.

Case in point: JSON parser libraries. Most of them allocate tiny JSON values individually from the C library heap. It makes absolutely no sense. The way most command-line tools work with JSON data is to slurp in a data file, parse it to a JSON structure, process that data and then exit. They typically don’t sit around holding a pointer to a small fragment of a JSON document, it’s all managed together. The choice to allocate JSON values on the CRT heap individually is terrible in a scenario like that. What we should do instead is to exploit the higher-level knowledge about the memory usage patterns.

A lot of people are now preparing their rebuttal detailing how fast a general heap manager can be and that we can optimize the allocator — I call bullshit on all that, and here’s why.

To prove how much of a difference this makes in practice, I modified the popular json-parser library so I can select at runtime whether to use the CRT heap or a custom page-based linear allocator. The program simply reads a block of JSON data into memory and parses it 50 times, immediately throwing away the parsed data.

You can check out the code here on github.

Here are the results:

$ time mmtest game.json heap

real 0m1.642s
user 0m1.589s
sys  0m0.051s
$ time mmtest game.json linear

real 0m0.755s
user 0m0.697s
sys  0m0.057s

The stock version using the CRT heap is 2.2x slower!!

Running with valgrind to trace the heap activity reveals the (only) difference between these runs:

heap: 7,024,565 allocs, 7,024,205 frees, 337,671,820 bytes allocated
linear: 615 allocs, 255 frees, 420,426,770 bytes allocated

No wonder it’s faster! That’s a LOT of cache misses and pointless branches we’re sidestepping entirely in the malloc/free functions; more than 7 million calls. Sure, we’re wasting 90 MB of RAM, but do you really care in a command-line program that does its thing and then quits? If you’re worried about it, it can be trimmed down easily. I used a block size of 1 MB in my linear allocator for this test. Shrinking it down to 128 kb will still give you almost the same benfit. (The mismatched allocs/frees are from Apple code outside my control, go figure.)

The lesson to take away from this is simple: Never make memory management decisions at the lowest level of a system. Always allow the user to control the memory management policy on the highest possible level, where its possible to do high-level optimizations safely.

As a bonus C++ bashing, this is why anything based on the STL sucks, as it by definition takes microscopic decicions about where memory comes from and how long it will live. The standard response to MM performance concerns is to ”replace the allocator with a more efficient one”. This is the wrong answer, as I hope you can clearly see above. The fastest code is the one you never have to run.

Rant over. :)

Looking towards Tundra 2.0

If you don’t know already, Tundra is an open-source build system I’ve written and have been maintaining for a couple of years now. The source is available on github (https://github.com/deplinenoise/tundra).

In maintaining the tool, I’ve learned some lessons and rethought some of my initial design decisions. This post is about how I want to evolve the tool with those lessons in mind.

What has worked well

Tundra is up to version 1.2 now and is relatively stable. It’s definitely not all bad. The current tool is fast and portable. Its used in a lot of projects by friends and other random people on Twitter (you know who you are!) Specifically, I think the following design decisions have worked out well:

Multi-threaded build engine in C

The build engine itself has proven to be robust and relatively bug-free since early on. It does parallel dependency and header scanning, so incremental builds run really quickly. It uses simple memory management techniques (just flat arrays mostly) which has turned out to work really well.

Using Lua for configuration

Lua is remarkably fast and flexible. I didn’t expect the front-end to run as quickly as it does, given that it does a ton of string processing and data collection before the actual build engine runs. It will commonly dish out a large DAG for the backend in under 100 ms, which is a great achievement for a interpreted script language that has to hit disk to find file names and other things.

Hard separation between configuration and building

The current design emphasizes the two distinct phases in the tool: configuration and building. Once the Lua front-end has produced a DAG, it is no longer part of the runtime. This makes the build faster, as the C side can run unimpeded by script language performance and multi-threading issues.

Header scanning over implicit tracking

Some new-wave build tools (tup, ninja) depend on tool-specific options or OS/file system-level hooking to automatically track implicit dependencies. This can be convenient when it is supported, but significantly limits the build tool in what platforms it can run on and what tools you can use with it. Tundra has successfully been able to accommodate quirky Amiga toolsets, cross-compilation scenarios (Tundra itself is cross-compiled!) and other non-mainstream scenarios such as custom assemblers by scanning the file system for #include information and keeping a cache of such discovered dependencies.

What could be improved

There are some things I’d really like to fix with version 2.0:

Cleaning up stale build products in the file system

Because the DAG is generated fresh every time, the front-end doesn’t generate a complete DAG including all possible configurations and variants. This means a debug build will know nothing of the release build targets. If the build engine knew about all the possible outputs, it could safely deduce that certain output files are now stale and can be deleted. This keeps the file system nice and tidy.

Targeting the build engine with other types of configuration input

The build engine doesn’t really care about the Lua configuration, but because the tool is a monolithic executable with a single front-end, it’s awkward to try to use it to build something completely unrelated to code. For example, if you’d like to use the build engine to build game assets, it requires a significant amount of Lua scripting within the existing framework (which is code-oriented) to try to express those build rules. If the build engine could be configured in more ways, it would be more useful.

IDE integration

There is some integration with Visual Studio and Xcode in the current tool, but it is relatively limited. It would be cool if the current front-end could be made to generate project files that integrate with these tools more easily.

The Plan

Here’s how I’ve been thinking about fixing some of this, while keeping the best features around.

Completely divorce configuration and build engine

It would be cool to split the tool in three pieces, analogous to how the gcc compiler driver is really multiple executables:

  1. One driver executable (the one you invoke – tundra.exe). This is a light-weight front-end binary that first launches a configuration executable (if needed, see below) and then kicks the build engine.
  2. One configuration executable (tundra-luafrontend.exe). This is a program that encapsulates the task of reading configuration data and producing a DAG as output.
  3. One build engine executable (tundra-buildengine.exe). This program just reads DAG data from a file and executes the build as fast as possible.

This design enables some nice benefits:

  • The front-end binary only needs to run when needed (the build files have changed, glob queries would change results, that sort of thing.) Most builds use an identical input DAG. Therefore the top-level build can just skip running the front-end program entirely and use the cached DAG from the previous run. This will shave of precious milliseconds from every incremental build.
  • The build engine can be targeted directly using a custom front-end tool. This means you can plug in some other DAG generator to build your game assets or run other build rules that are not easily expressible in the Lua front-end. Maybe you already have existing build system data (Visual Studio projects?) that you want to run on the Tundra back-end.
  • Maintenance becomes simpler, as changes to the front-end can be tested in isolation without involving the back-end, and vice versa.

Produce complete DAG data – Clean up file system

If the build-engine sees complete DAG data (all configurations, variants, platforms and so on) it can clean up the file system state before it starts building. The reason this isn’t done today is performance; we don’t want the Lua front-end to generate 8x as many DAG nodes if we’re only going to build the debug config anyway. With the mandatory caching outlined above, this problem goes away–the data will only be regenerated when you change the build files anyway.

Feedback welcome

What would you like to see in Tundra 2.0? Drop me a line and let me know!

The App Store Nightmare

Edit: The specific error message I was getting is a bug in the App Store relating to disabling SpotLight on your Mac. Once this was fixed, apps from my Swedish region again started updating on the machine, so it seems Apple Support’s claim about IP address limitations is bogus. Why would they say there is one though?

I recently moved from Sweden to the US. Now that my bank is here in the US, I switched my Apple account over to the US region.

Doing so made everything under “Purchases” and “Updates” disappear in the Mac App Store. After a long frustrating email exchange I was told by the App Store support that apps are tied to a region, so if you have downloaded an app in one region it’s forever tied to that region. Their message was: You have to keep your account in the Swedish region to receive updates. Note that this applies even to free stuff like Twitter!

So at this point I created another account, to be able to download US apps. Everything was fine for a while, although I would have to sign in and out of both my accounts to get updates going. I was getting updates.

Today however a new problem emerged as I tried to update Xcode (which is on my Swedish account.) It’s no longer showing in “Purchases”, and “Updates”. Navigating to the Xcode page however shows an “Update” button. Clicking that brings up this ominous dialogue

Image

Signing in to my US account, I’m greeted with the same dialog.

Puzzled, I sent an email to the Mac App Store Support explaining I could no longer get updates for my apps, regardless of the account I used. Here’s the takeaway from that email:

Please know that since the app was purchase on your iTunes account in Sweden, you may not be able to update app, now that your IP address is in the US. Content in the iTunes store is country specific. Andreas, for this issue, you may need to call our AppleCare technical support team. A technical Advisor will be able to tell you about Apple’s complimentary and fee-based support. The technical Advisor can also assist you in determining what option might be most helpful to you in this case.

So because I have a US IP address I can no longer update my Swedish apps? I took the advice and called Apple Support. A lady informed me that because I didn’t have active AppleCare for my Macs there was nothing they could do. She offered me to by a one-time pass for $49 or a three year plan for $249. Paying $49 to help Apple debug their DRM? WTF!

I’m so pissed off at Apple right now I don’t know what to do except write it up as a warning to others.

To summarize:

  • If you switch regions, you can no longer update (or even see) your apps in the App Store. This goes even for free apps like Twitter and Xcode. So you need to keep your region set to the one where you purchased the app. But..
  • If you’re in the US, you can’t download updates for your non-US apps due to IP address constraints. (This seems possible again now that SpotLight is enabled.)
  • There’s no way to migrate an app from one account to another (even for freebies such as Twitter)
  • There’s no way to permanently remove an app from an account (to reinstall it on another)
  • App Store Support can do nothing to help you
  • Apple Support requires AppleCare to do anything about it (if they can, that’s unknown)

So this leaves me with the following options:

  1. Completely nuke my machine, create a new US Apple account and repurchase everything (yeah, right)
  2. Pay $49 to have Apple Support look into the issue
  3. Give up and use Linux (I hear Windows 8 will have a locked-down App store too)

If this is any indication of the future of software distribution in general, I think I’m going to just give up and go into farming. AFAIK there are no DRM or geolocation limitations on farming equipment yet.

The Register Pressure Twilight Zone

At Insomniac Games we’re currently doing a performance push which means lots of opportunity to look at compiler output. In this post I want to share a story on fighting a particular compiler optimization that was causing me problems.

One of my areas has been the decal system which is very heavy on 3d data processing — a perfect system for SIMD work. While working on one of the bigger compute loops I noticed something odd about the generated PPC code. All of the actual computation work was being done on the vector unit (as it should!), but there was a lot of stack traffic originating from the integer unit. The compiler was overcommitting integer registers it seemed, in what was essentially a pure-SIMD loop. Puzzled, I dug in and tried to work out what was going on.

Let’s look at the code. The loop I was working on processes 4 triangles at a time and has a structure like this:

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  VecSimd t1a = SimdLoad(&input[0].m_Position);
  VecSimd t1b = SimdLoad(&input[1].m_Position);
  VecSimd t1c = SimdLoad(&input[2].m_Position);
  // ... 9 more loads
  VecSimd n1a = SimdLoad(&input[0].m_Normal);
  VecSimd n1b = SimdLoad(&input[1].m_Normal);
  VecSimd n1c = SimdLoad(&input[2].m_Normal);
  // ... 9 more loads
  
  // (crunching)

  SimdStore(out0, &output[0].m_Position);
  SimdStore(out1, &output[1].m_Position);
  SimdStore(out2, &output[2].m_Position);
  // lots more stores

  // increment input and output pointers
  input += 12;
  output += 12;
}

What I found was that the vector loads were all being done from different (integer) registers and they were all being incremented individually by the compiler where I was expecting a single base register and 12 offsets. Because there are so many loads and stores going on, the compiler would run out of registers and start spilling them to the stack, generating load-hit-stores and memory penalties all over the loop!

Mike Day offered the following insight:

… in the case where there’s a sensible number of pointers – small enough not to incur spilling onto the stack – the ‘create n pointers’ approach is sensible for the compiler as long as it follows through by doing all the loads using the indexed addressing form of the load instruction. [...] The benefit is that instead of incrementing n pointers per pass (or one pointer n times), it only needs to increment a single offset by n times the stride, saving n-1 adds.

It turns out the optimizers for both our PPC compilers were so keen on using the indexed vector load instruction with a single increment that they would blindly overcommit the integer register pool to achieve that particular instruction selection.

The question then became–how do we force the optimizer to stop doing it? The optimizer can see that the input and output pointers are being moved consistently by a certain stride and any attempt on my part to use variations in the indexing expressions just ended up generating slightly different variations of the same bad behavior.

So let’s look at what we’d want if we were writing the loop ourselves in assembly. In this case a better code generation would be to have a single input register and increment it after each SIMD load, using just one register. The key to accomplish this code generation is in C++ to invalidate the optimizer’s assumptions about the pointers and strides so it can’t create an array of independent pointers. But how do we do that? After all, everything here is using linear memory accesses with a stride that’s well known at compile time, enabling the unwanted optimization.

Time for some trickery! The loop structure I ended up using looks like this:

static volatile uintptr_t s_secret_sauce = uintptr_t(ptrdiff_t(-1));
static volatile uint32_t  s_stride       = sizeof(Vertex);

// load our constants in from memory into registers
const uintptr_t secret_sauce             = s_secret_sauce;
const uint32_t  stride                   = s_stride;

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  // establish base pointer for all loads in the loop
  uintptr_t base = (uintptr_t(input) & secret_sauce) +
                   offsetof(Vertex, m_Position);

  // load and bump base pointer with stride
  VecSimd t1a = SimdLoad((void*)base); base += stride;
  VecSimd t1b = SimdLoad((void*)base); base += stride;
  VecSimd t1c = SimdLoad((void*)base); base += stride;

  // ... more loads
 
  // update input pointer for next loop iteration
  input = (Vertex*) base;

  // ... rest of loop  

  // ... stores handled similarily
}

There are two key things going on in this code that breaks the optimization pattern:

  • We’re loading stride and secret_sauce from memory before the loop, thereby preventing any compile-time knowledge of them.
  • We’re breaking any loop-wide analysis of the input and output pointers by masking with a value that is unknown to the optimizer, forcing it to rely on the sequence of statements we’ve laid out exactly.
  • This generated the desired instruction selection and removed all stack spills from the loop. In one particular case I was using as a performance test case this saved over 30k instructions over the loop lifetime, a significant chunk of work. It also removed several hundred load-hit-store penalties.

    This does generate an additional and instruction which is an artifact of the technique, but compared to the much worse stack spilling code this the way better deal. The and could also be replaced with an add of zero or something similar, but it will just amount to the same 1 or 2-cycle overhead in the end.

    This has to have been the first time I’ve used static volatile variables to improve performance. Hopefully it’s useful to someone else encountering this behavior!

    setjmp + goto = error handling bliss

    Oh yes, it’s a post about setjmp. Loved by few, feared by most — setjmp/longjmp is still the best way to write maintainable error handling code in some programs!

    Consider a line-based assembler or compiler where we want to deal with each line as it comes in from a file. We don’t want to stop compiling on the first line with an error, but flag errors and keep going. As soon as we find an error in a line, we want to abort processing that line and go on with the next one.

    The traditional way to solve this is to put a setjmp handler inside the line loop and be prepared to deal with a line going bad. But calling setjmp is expensive as it moves a lot of data from CPU registers to memory (depending on the architecture of course).

    Here’s an interesting way to structure the input loop instead which gives all the benefits of setjmp/longjmp but only incurs a cost when there is an error:

    static jmp_buf error_jmp;
    
    static int process_file(FILE* input)
    {
    	volatile int error_count = 0;
    retry:
    	if (0 != setjmp(error_jmp)) {
    		++error_count;
    		goto retry;
    	}
    
    	while (NULL != fgets(line_buf, sizeof line_buf, input)) {
    		/* process line, calling arbitrary functions */
    	}
    
    	return error_count;
    }
    

    Here’s how it works. We re-establish the setjmp point at the retry symbol rather than inside the line processing loop where we want to get actual work done. Whenever there is an error, the error handling function calls longjmp and we end up back in the if statement following the setjmp, which bumps the error count and resumes by resetting the jump buffer and then running the line loop again!

    To put it all together, here’s how you could design an error reporting function that terminates the processing of a line and resumes with the next one:

    static void parse_error(const char *fmt, ...)
    {
    	char buf[1024];
    	va_list args;
    
    	va_start(args, fmt);
    	vsnprintf(buf, sizeof buf, fmt, args);
    	buf[(sizeof buf)-1] = 0;
    	fprintf(stderr, "error: %s\n", buf);
    	va_end(args);
    
    	longjmp(error_jmp, 1);
    }
    

    You can then write pretty complex handlers in vanilla C without worrying about how to get out of a tight spot when you encounter a parse error:

    	if (!foo_bar(data))
    		parse_error("expected foo bar to be true");
    

    If you don’t like the static jmp_buf, that can of course be solved too by passing around more state in your processing code. If you do that, you can keep it on the stack inside the outer function which makes the code thread-safe as well.

    Follow

    Get every new post delivered to your Inbox.