Does experience slow you down?

This is somewhat of a philosophical post inspired by a good discussion we had in my engine group at Insomniac Games that also spilled out on Twitter. The question I asked was: “Do you think having more experience slows you down? And how do you combat that?”

The question seems illogical at first. Having more experience should make us better and therefore faster at getting results, right? That might certainly be true for a lot of problems, but when we’re trying to innovate and break new ground having many years of experience can lead to a condition known as “analysis paralysis” where we find ourselves unable to get started because we’re finding problems with our design before it even exists.

Of course that type of deep analysis has massive value itself. We want our designs to be efficient, make good use of memory, have a great UX and scale up to whatever production constraints we might face. But if we worry about those things too early we might never get to the points where we can know if an idea has any value because we never even get started.

In my experience this creeps up in particular when trying to innovate and break new ground. We have an idea that something would be useful and solve some particular problem. But years of experience with similar or related solutions might lead us down the analysis rabbit hole and we start to ask questions like:

  • How can I reduce the memory waste of this thing?
  • How will this thing scale to 10x the number of objects?
  • Will this approach lend itself to SIMD or a GPGPU implementation?
  • How will this work with 100 team members?
  • How is source control going to work?
  • How is it going to be network synchronized?
  • How can I maximize cache throughput for the data?

It’s easy to discard good ideas because we project there will be some future possible problems with them. But if we were actually to try the idea we might learn that only a few (or none!) of those concerns are actually valid based on what we find trying to implement it!

For example, if you’re just starting out and all you know is malloc() and linear searches, then everything you try to program will naturally map to those two concepts. There’s nothing to think about, and the neophyte will start programming right away to try his idea out. But someone who has 10+ years of experience of tuning custom memory allocators or crafting customized data structures will naturally have a tendency to think much deeper about a possible design in those two areas and disqualify every possible solution based on some theoretical problem it might encounter.

After two weeks the neophyte has a working prototype of his new thing while the battle-scarred veteran has 4 sheets of papers with half-formed plans for memory optimization and data structure sketches, or network synchronization problem dead ends or whatever his particular domain is.

At this point, what is the better position to be in? Being able to judge the merits of a naive implementation that actually runs? Or not having a prototype implementation at all because we don’t see how we can get past some future problem we might run into if we ever get there?

I’d argue that the naive start is way better, and cheaper too. If the solution has value and actually solves some problem (however poorly) we can apply optimization experience much more easily to improve the design incrementally. Furthermore, all the hard data dependencies will naturally have been shaken out by actually by taking an idea from start to finish which means the space is constrained and optimizations fall out much more naturally.

My advice: don’t be afraid to write “bad” code to test an idea out. Violate your own basic principles. Use malloc, or even std::vector. Use a slow scripting language. Use scalar code full of doubles. Use a single lane of a GPU wavefront. Use Excel and a bunch of awful awk scripts. Whatever the shortest path is to getting some output that will validate the idea. Only then start worrying about all the real world constraints.

If things don’t work out, then all the other constraints are moot points anyway. If things do work out, we know we have the know-how to get them sorted out.

Here are a few concrete tips that might help:

  • Write down all your concerns. Cross off anything that relates to an area where you have deep experience already. Solve what’s left using a prototype. Come back to your concerns when you’re satisfied the real unknowns are solved.
  • Start new files with the comment: “Do the simplest possible thing that can work. You can fix it later.”
  • Keep prototype work off to the side to remove any idea of peer pressure early in the discovery process.
  • Practice. Participate in demo scene blitz events or game jams where the whole point is getting something working right away and then iterating on it.

Finally I want to share this fantastic piece of advice I got on Twitter from Branimir Karadžić (@bkaradzic):

 Just do it, hack! I approach code like games, rush deep into room, trigger all NPCs, die, after respawn I know where NPCs are.

When trying to break new ground, that’s what it’s all about.


Git version tagging Tundra builds

Sometimes you want to feed in the current git version into a program built with Tundra and have it automatically regenerate as needed. This is neat because embedding the latest commit in a program is a lot more predictable and robust than maintaining some random version number you might or might not remember to bump when you cut a release.

We’re going to generate a C source file with a #define that just states the git commit.

Here’s how to do it in Tundra, step by step.

1. Make sure your build has a code generation pass that runs before your compilation pass:

 Passes = {
   CodeGen = { Name = "CodeGen", BuildOrder = 1 },

2. Write a shell script that generates the header file from git information. We’ll put this in

#! /bin/sh
vers=`git rev-parse --short HEAD`
echo "#define GIT_VERSION \"$vers\"" > $outfn

If you’re on Windows, you could do the same thing with a Perl script or whatever your favorite script environment is. If you want to support both UNIX and Windows in the same build, make the command an environment variable and customize it based on the config. (Or make a portable script/tool, your choice!)

3. Add a DefRule describing to Tundra how to generate the header with your script:

 require "tundra.syntax.glob"

 DefRule {
   Name       = "GitVersion",
   Command    = "./ $(@)",
   Pass       = "CodeGen",
   Annotation = "GitVersion $(@)",
   Blueprint  = {
     OutputFile = { Type="string", Required = true },
   Setup      = function (env, data) 
     local inputs = Glob {
       Dir = ".git/refs/heads",
       Extensions = { "" },
     inputs[#inputs+1] = ".git/HEAD"
     return {
       InputFiles = inputs,
       OutputFiles = { data.OutputFile },

This thing looks complicated, but it’s pretty straight forward. We’re saying we have a rule to make some file out of thin air, and then in the Setup function we glob the git branches and tack on the main file git uses to track the current branch. We don’t actually use these inputs in our script, but their presence (and the glob) informs Tundra that this rule has to re-run whenever those files change, which gives us the desired property of having the latest commit in a source file.

Note how the rule definition references the pass we set up earlier, so we’re guaranteed it will run with a barrier between it and everything that could possibly include it. Passes is the mechanism Tundra uses to make implicit dependency scanning safe in multi-threaded builds with a mix of code generation and code compilation going on.

4. Invoke the rule in your source list:

 Sources = {
   GitVersion { OutputFile = "$(OBJECTDIR)/gitversion.h" },
   "other.c", "main.c", ...

If you want to use the rule more than once, make it its own target with a local variable and add that variable to all source lists. Including the same rule above multiple times would generate an error, because Tundra refuses to have multiple build rules target the same output files (for good reasons!)

5. Include the generated header in one or more source files & enjoy

You’ll have to add $(OBJECTDIR) to CPPPATH as we put the generated header there.

That’s it. Hopefully this helps someone trying to do more advanced Tundra setups with code generation. Some properties of this solution:

  • The gitversion.h file will only be updated when you commit something, or add a branch, so generating it won’t slow every build down.
  • The file is generated per-config, but you could set it to ConfigInvariant to generate one shared for every config if desired.
  • We have to call out to a script to make the actual file, because Tundra will only re-run the Lua scripts when they have changed (or when the result of a glob has changed.) This is part of why Tundra is fast, but it also means that the setup is more complex than say Make.

Side note: When I was making this post I found a bug in the handling of filename extensions in the code, so to get the Glob handler to return the git refs, you’ll have to grab the latest Tundra version as of 6/21/2014. Hey, it’s not often you ask for the extension of “.git/foo/bar” and expect an empty string!

Using C++11 Capturing Lambdas w/ Vanilla C API Functions

Here’s a C++11 specific idea I’ve been applying a lot recently.

When designing libraries, I don’t want to expose implementation details as templates. I want hidden types and vanilla C ABI functions on the link level. The reasons are for compile/link speed and to be able to keep a clean binary interface between modules. (I have a separate post about this subject you might like or hate.)

For example, at Insomniac Games, our general sorting routine is defined like this:

typedef int64_t (*QuickCompareFunc)
(const void*, const void*);
typedef int64_t (*QuickCompareFuncReentrant)
(const void*, const void*, void* arg);

void QuickSort (void* __restrict elems, int64_t elem_count,
size_t elem_size, QuickCompareFunc cf);
void QuickSort (void* __restrict elems, int64_t elem_count,
size_t elem_size, QuickCompareFuncReentrant cf,
void* arg);

Note the second overload; it allows you to pass through some opaque state to the compare function. This is essential if you’re trying to sort an array of indices or something like that, where the elements in the array themselves aren’t enough to compare them.

While these functions are good for compile & link speeds, it’s easy to screw up calls while making changes under maintenance because of the typeless interface. So I went looking for a way to combine the standard calling style of these functions with more type safe lambda objects that allow closures.

It turns out it’s pretty easy to do, and it generates good code. You add a template function that gets inlined into the call site. That template function accepts any type of callback object, and funnels it via the user_data pointer. Here’s the overload for type safe + lambda goodness QuickSort():

template <typename T, typename Func>
void QuickSort(T* __restrict elems, int64_t elem_count, Func f)
  auto compare_closure = [](const void* ll, const void* rr,
void* context) -> int64_t
    const T* l = (const T*) ll;
    const T* r = (const T*) rr;
    return (*(Func*)context)(l, r);

QuickSort(elems, elem_count, sizeof elems[0], compare_closure, &f);

What’s going on here? Well, the function object we’re getting in (the lambda expression) is not going to be a match for the expected C function interface. So we create a new anonymous function (compare_closure) which is a match. We then pass a pointer to our function object through the user_data machinery, cast it to the right type, and call through the pointer.

What does the codegen look like? It looks good. The reason is that the compiler is smart enough to inline your entire lambda in the compare_closure, and there’s still only a single function call. One drawback I’ve found is that in unoptimized builds there’s an additional function call to jump through. Another potential drawback is if you have tons of these functions and you’re calling them a lot. Some compilers generate redundant copies of these template function bodies that the linker will have to clean up.

Armed with this template we can easily sort an array of Frobs using an index table:

Frob* frobs = ...;
size_t frob_count = ...;
int16_t* indices = ...;
QuickSort(indices, frob_count, [&](const int16_t* l,
const int16_t* r)
return frobs[*l].m_SortKey - frobs[*r].m_SortKey;

What makes this better then std::sort or similar sorting templates? There’s still only one copy of the entire sorting algorithm, so it has huge implications for code bloat if you’re calling QuickSort() a lot.

Since discovering this technique we’ve been applying it to various things (it’s jokingly called “Fredriksson’s Reverse”.) It’s pretty easy to paper over a separately-compiled routine with a typeless interface and turn it into a convenient API at low cost. All it requires is that you have some opaque pointer you can use to recover the function object.

To wrap up, here’s another example where I’m using a separately-compiled zlib state machine pump that wants to pull in I/O data. The I/O data comes from some HTTP endpoint. Normally that would be a pain in the neck to write and adapt to a C-style callback interface, but with our trusty reverse-lambda hack, it’s a breeze:

bool ok = ZlibHelpers::DecompressFile(buffer, u_sz, d_sz, scratch,
[&](void* buf, size_t size) -> bool
  return DevHttpClient::ReadChunkedBlock(&m_HttpState, buf, size);

Note that you can also access member variables (m_HttpState) and everything works as expected. My function signatures look like this (formatted for display in web browser..):

typedef bool (ZReadFn)(
void* target_buffer,
size_t read_size,
void* user_data);

bool DecompressFile(
void* buffer,
size_t compressed_size, size_t decompressed_size,
MemAllocStack* scratch,
void *user_data,
ZReadFn* read_fn);

template <typename Fn>
bool DecompressFile(
void* buffer,
size_t cs,
size_t ds,
MemAllocStack* scratch,
Fn fn)
auto closure = [](void* target_buffer,
size_t read_size,
void* user_data) -> bool
return (*(Fn*)user_data)(target_buffer, read_size);

return DecompressFile(buffer, cs, ds, scratch, &fn, closure);

Use the type system

This post isn’t very advanced, just a reminder about basics that help our lives.

The value you can squeeze out of the C and C++ type systems (for all their deficiencies) is massive.

Consider a asset loading API where there are multiple clients spread around a big engine. The goal is to allow code to request some asset data to be streamed in. Assume the asset data can be split in a few different distinct types as well, to support things like texture streaming (to split base mips away from high mips and keep two distinct data blocks around.)

One common approach is this API at the lowest level:

bool BeginAssetLoad(const char* path);

You’ll then maybe expose a couple of helper functions to make paths and assorted helpers.

But a better API is:

bool BeginAssetLoad(AssetId asset_id, AssetSubType sub_type);

Here are a few reasons why I think this style of API is better:

  1. If we decide to change how we represent asset ids or subtypes, we get compile errors for every single call site and we can fix them up statically.
  2. If we add a new required parameter, we again get compile errors and see the scope of the change.
  3. If we change our minds on how assets are represented on the file system/archive side, we have a single point of change as no one but this low-level function actually deals with the translation from asset ids to filenames or archive hashes, or whatever loading approach we use.
  4. We avoid questions of lifetime for the passed in character string.
  5. Client code can’t easily pass in invalid data.
  6. If our AssetId type is based on a simple numerical type, we can use conditional breakpoints to break when a particular asset is loaded.

Some very nice advantages. The basic idea is to retain type information for as long as possible through the data flow, and transform to a less precise form as late as possible. It’s easy to say, but sometimes hard to enforce!

Optimizable Code

You don’t need to be a low-level programming expert to write fast code. The trick is to write code that CAN be optimized. (And the trick to doing that is to worry about memory accesses.)

For example, consider this piece of code:

class GameObject {
  float m_Pos[2];
  float m_Velocity[2];
  char m_Name[32];
  Model* m_Model;
  // ... other members ...
  float m_Foo;

  void UpdateFoo(float f)
    float mag = sqrtf(
      m_Velocity[0] * m_Velocity[0] +
      m_Velocity[1] * m_Velocity[1]);
      m_Foo += mag * f;

Let’s say you have a 100 or so of these game objects, and they’re spread out randomly in memory. Every frame you do this:

for each game object pointer p:

It’s starting to show up in the profiler as a hot-spot. What’s going on?

First, we need to understand where the cost is coming from. Let’s have a look at the UpdateFoo() function again. What’s expensive here?

void UpdateFoo(float f)
  float mag = sqrtf( // <= Square root
  m_Velocity[0] * m_Velocity[0] + // <= Memory read
  m_Velocity[1] * m_Velocity[1]); // <= Float math
  m_Foo += mag * f; // <= Memory r/m/w, Float math

OK, so we have a mix of things going on here. Your instinct might be to think of the square root as the expensive part of the update, and 15 years ago you might have been right. But nowadays, memory accesses that miss cache are are usually at least an order of magnitude more expensive than a square root.

A useful metric when eyeballing some code is to guess at the worst case execution time in terms of L2 cache misses.

Assume some x86-based platform, with a typical cache line size of 64 bytes. Let’s say a L2 cache miss is 150 cycles (this is pretty typical.) The square root will likely be between 7-30 cycles depending on the microarchitecture (technically that’s latency and not “wait cycles”, but for our purposes here that will do.) The floating point math on a chip like this usually has a latency of 1-3 cycles per operation, but you can issue a bunch of them all at once, so for this excercise let’s assume they’re 10 cycles for the entire function (it’ll probably be less.)

Here’s our breakdown then:

  • Access m_Velocity – 150 cycles
  • Access m_Foo – 150 cycles (it’s on a separate cache line)
  • Float math: 10 cycles
  • Square root: 30 cycles

So we’re waiting 300 cycles on memory to complete 40 cycles of real work.

How can we optimize this?

Well, we’ve established that 88% (300 of 340 cycles) is spent just waiting for memory to come in. Even if we get rid of the square root, or replace it with some approximation, we haven’t really moved the needle at all.

We can get rid of the cache miss for m_Foo by moving it closer to m_Velocity. That saves us 150 cycles, and brings us down from 340 cycles to 190. But we’re still spending 79% (150 of 190 cycles) of the time waiting for memory.

To really speed this up, we have to rearrange the data. From the CPU’s point of view, the ideal case is a tightly packed, sequential block of RAM that we can stream through. Let’s change our design:

struct FooUpdateIn {
  float m_Velocity[2];
  float m_Foo;

struct FooUpdateOut {
  float m_Foo;

void UpdateFoos(const FooUpdateIn* in, size_t count, FooUpdateOut* out, float f)
  for (size_t i = 0; i < count; ++i) {
    float mag = sqrtf(
      in[i].m_Velocity[0] * in[i].m_Velocity[0] +
      in[i].m_Velocity[1] * in[i].m_Velocity[1]);
      out[i].m_Foo = in[i].m_Foo + mag * f;

Notice that the actual code to compute the output has not changed. The number of memory accesses are the same. The key reason this routine will perform better than calling the original UpdateFoo() on objects scattered in memory is memory locality. The FooUpdateIn structure is 12 bytes. That means we can fit 5+ (5.33) of these into a single cache line. If we go back to our “worst case” analysis method we find that only every fifth iteration through the loop will pay for a memory access for the input structure.

But in practice most x86 CPUs have h/w prefetching units which will detect strided memory accesses like this and prefetch the upcoming lines into the cache, so the cost is actually much lower. We’re now doing roughly 5 x 40 = 200 cycles of work per cache line of input data. This completely overlaps the latency the hardware prefetcher needs to pull in the next line from main memory into the cache. So by transforming our update problem into a loop that plays well with the hardware we’ve essentially REMOVED the cache miss cost, which according to our eyeballing was around 90% of the time. That’s a 10x improvement, with no SIMD, assembly or other tricks anywhere to be found!

(Automatic h/w prefetchers typically only load into the L2 cache, and we’ll still be stuck with a L1 miss for each new cache line. But that’s easy to amend with a L1 prefetch hint when and if you need it.)

This code is now also suitable for further low-level optimization. We can much more easily reorder the input and outputs to use SIMD, SOA data layout and prefetch instructions to get the last bit of performance out of this. But that’s maybe an additional 2-4x speedup compared to the 10x speedup we got from rearranging the data structures and we can probably push that off knowing that we can reach for that hammer later.

This transformation didn’t touch the inner computation itself, but it affected every single design choice around it. That’s why it’s so important to write optimizable code from the start, keeping the slowest part of the machine in mind: the memory subsystem.

As a project is coming to a close and you need a few extra ms of optimization to hit the frame rate, you’re going to want functions like UpdateFoos() above that are easily adapted to low-level techniques like SIMD and prefetching. If the entire engine is designed around the first example’s “one at a time” approach, there’s just not enough work to sandwich inbetween the memory accesses, and optimizing is going to be very time-consuming as the code needs to be rewritten.

To boil this down into a few takeaways:

  • Memory latency is the most important design constraint when designing for performance
  • Use “number of memory accesses to distinct cache lines” as a rough indicator of worst case performance.
  • Worrying about memory layout right away gets you 90% of the way, without low-level coding
  • Forget about “optimal” or advanced code. Just get the basics right so you can optimize later if you need to.
  • Strive to design your routines as loops working on more than one item in memory order.
  • Try to feed each such routine with only the data it needs to do its job, as tightly packed as possible. This might involve splitting up large logical objects into distinct pieces.
  • When updates are happening in memory order with minimal waste, take another look at your profiler data to see if SIMD or prefetching is in order. (But most likely it’s already plenty fast.)

Optimize for Iteration Time

Our tools are not getting any faster. Linkers suck. Compilers bog down under loads of templates and inline functions. What can be done?

We can try to optimize for iteration time.  Separate compilation is the best iteration time tool we have in our arsenal when programming vanilla C/C++.

Take inline functions as an example. My advice: while developing a system, prefer to not make anything inline. Inlining functions forces you to expose your data structures in header files. Data structures change continually as you’re working on a system. Design away that iteration time pain from the beginning.

(If you’re now thinking – but my GetFoo() function will now be slower! – you’re probably worrying about the wrong thing. Do you really have just ONE Foo? Why don’t you have a function that works on all the Foos?)

Let’s take a look at the venerable “system singleton” design. We have some global system we need to expose to clients. There’s only one of these. A lot of engine code will expose that as a singleton class, with all its state in the class declaration, with a lot of related classes and structs with inline functions.

If you want to change your internal state, you have to change the header, and needlessly recompile all client code.

A better way:

  • One header file with opaque types and function declarations only.
  • One or more private header files that expose data type details to implementation files.
  • One or more implementation files that define data structures and transforms on them.

Here’s what that can look like:

// Header file
namespace FooSystem {
  struct Frob;

  void Init(size_t max_frobs, Allocator* alloc);
  void Shutdown();

  void CreateFrobs(
       Frob** out_array,
       size_t count,
       const void* data, ...);

  void DestroyFrobs(Frob** frobs, size_t count);

  void UpdateFrobs(float delta);
  void RenderFrobs();

// Implementation file
struct FooSystem::Frob {
  // Members go here
  // -or- you might chose to make Frob just an
  //      integer handle to SOA
  //      state stored in the system instead

// Static "singleton" state of the system,
// named "S" for less typing.
static struct {
  // Space for frobs
  Frob *m_Pool;
  size_t m_PoolSize;
  // Whatever else you need
} S

void FrobSystem::Init(size_t max_frobs, Allocator* alloc)
  S.m_PoolSize = max_frobs;
  S.m_Pool = alloc->Alloc(sizeof(Frob) * max_frobs);
  // And so on

What does this buy us over the traditional “singleton class”?

  • You can change the system’s internal data types without recompiling client code – optimizing for turnaround on data design
  • It enforces entirely opaque types which cannot easily be tampered with (accidentally or not)
  • It minimizes surface area between systems and so makes debugging and trouble-shooting easier
  • It leads to a “more than one” thinking which is good for efficiency
  • It’s easier to make into a DLL/shared object, because client code doesn’t have to worry about the storage of the “system object” – if they can call the functions, the data is there too.
  • It’s possible to write in pure C as well, which is nice for compile speeds!

How to simplify cross-platform builds

Just a few guidelines I think are helpful if your codebase targets multiple platforms:

Build your code using a cross-platform external build system rather than relying on a specific IDE. Or at least generate the IDE-specific data if you’re OK with a lower degree of quality in your build. Many IDEs don’t track dependencies correctly and don’t clean up stale output files when they are no longer produced by a build, leading to lost productivity and programmers needlessly cleaning everything as a cure-all at the first sight of problems.

Avoid checking in and maintaining IDE-specific files. Instead generate them as needed from your build system metadata. There are plenty of options for generating Visual Studio and Xcode project files. This way you keep your build data in one place for all platforms where it can be maintained more easily.

Use the file system for configuration. Instead of excluding and including files manually, put platform-specific files in their own subdirectories. If you do this, a single recursive glob of the source directory together with a few filters will give you want you want without hundreds of exceptions. This also makes it trivial to add or remove files from the build, and you’ll avoid having stale files lying around that can confuse things later. Don’t feel bad about deleting source files, you can always get them back from version control later if you change your mind.

Avoid tool-specific settings if you can. Instead define higher-level concepts that you can map to IDE settings across platforms more easily. For example, don’t define your warning levels in terms of Visual Studio’s /W0-4, but instead say “This project uses our strict warning settings”. You can define elsewhere for each toolset what that means (preferably in one place.)

Avoid per-file options in configuration. If you need to disable a MSVC-specific warning, do so with a pragma in the file. If you’re making your own code generators, design them so that options controlling their operation go in the source files themselves rather than in command line options. That means files can be globbed up to a simple list from the file system without relying on lots of tedious settings that must be kept in sync between various build files and platforms. It also means the settings are in one place and will be correctly integrated with the file during maintenance.

Build your custom tools from source instead of relying on prebuilt binaries. This is friendlier to version control systems that don’t like binaries (git) and also helps you with cross-platform bootstrapping. The tools are also always in sync with the source files and you’ll see immediately if you broke something while changing what you thought was an unrelated file.