Walking Set Bits

Sometimes you want to walk the set bits of an integer and do something for each bit. For example, consider printing a human readable version of a flag field (which could have up to say 32 unique flags set).

The naive way to accomplish this is of course to visit every bit and see if something is set. But we’re cooler than that.

A better approach is to repeatedly pick off the lowest set bit until there’s nothing left. Here’s what that looks like:

while (bits) {
  uint32_t lsb = bits & ~(bits - 1);
  // do something with lsb here..
  bits &= ~lsb; // mask off lsb and keep going
}

This gives you the value of each bit as you’re walking it (i.e. bit 7 will have the value 0x80 if we’re counting from 0). If you need the bit’s position as well, it’s convenient to use the Intel BSF (bit scan forward) instruction via compiler intrinsics:

  int bitpos;
  while (0 != (bitpos = __builtin_ffs(bits))) {
    uint32_t lsb = 1 << (bitpos - 1);
    // use bitpos, lsb as required..
    // bitpos is 1-based on GCC
    bits &= ~lsb;
  }

It sort of sucks that we still have to compute the LSB value so we can remove it and move on to the next bit. If your CPU has a fast CTZ instructions (count trailing zeroes) we can substitute that for the barrel shift:

while (bits) {
  uint32_t lsb = bits & ~(bits - 1);
  int bitpos = __builtin_ctz(lsb);
  // use bitpos, lsb as required..
  bits &= ~lsb; // mask off lsb and keep going
}

The intrinsics shown here work on GCC and Clang. There are similar ones for MSVC but I’m too lazy to write Windows examples!

Does experience slow you down?

This is somewhat of a philosophical post inspired by a good discussion we had in my engine group at Insomniac Games that also spilled out on Twitter. The question I asked was: “Do you think having more experience slows you down? And how do you combat that?”

The question seems illogical at first. Having more experience should make us better and therefore faster at getting results, right? That might certainly be true for a lot of problems, but when we’re trying to innovate and break new ground having many years of experience can lead to a condition known as “analysis paralysis” where we find ourselves unable to get started because we’re finding problems with our design before it even exists.

Of course that type of deep analysis has massive value itself. We want our designs to be efficient, make good use of memory, have a great UX and scale up to whatever production constraints we might face. But if we worry about those things too early we might never get to the points where we can know if an idea has any value because we never even get started.

In my experience this creeps up in particular when trying to innovate and break new ground. We have an idea that something would be useful and solve some particular problem. But years of experience with similar or related solutions might lead us down the analysis rabbit hole and we start to ask questions like:

  • How can I reduce the memory waste of this thing?
  • How will this thing scale to 10x the number of objects?
  • Will this approach lend itself to SIMD or a GPGPU implementation?
  • How will this work with 100 team members?
  • How is source control going to work?
  • How is it going to be network synchronized?
  • How can I maximize cache throughput for the data?

It’s easy to discard good ideas because we project there will be some future possible problems with them. But if we were actually to try the idea we might learn that only a few (or none!) of those concerns are actually valid based on what we find trying to implement it!

For example, if you’re just starting out and all you know is malloc() and linear searches, then everything you try to program will naturally map to those two concepts. There’s nothing to think about, and the neophyte will start programming right away to try his idea out. But someone who has 10+ years of experience of tuning custom memory allocators or crafting customized data structures will naturally have a tendency to think much deeper about a possible design in those two areas and disqualify every possible solution based on some theoretical problem it might encounter.

After two weeks the neophyte has a working prototype of his new thing while the battle-scarred veteran has 4 sheets of papers with half-formed plans for memory optimization and data structure sketches, or network synchronization problem dead ends or whatever his particular domain is.

At this point, what is the better position to be in? Being able to judge the merits of a naive implementation that actually runs? Or not having a prototype implementation at all because we don’t see how we can get past some future problem we might run into if we ever get there?

I’d argue that the naive start is way better, and cheaper too. If the solution has value and actually solves some problem (however poorly) we can apply optimization experience much more easily to improve the design incrementally. Furthermore, all the hard data dependencies will naturally have been shaken out by actually by taking an idea from start to finish which means the space is constrained and optimizations fall out much more naturally.

My advice: don’t be afraid to write “bad” code to test an idea out. Violate your own basic principles. Use malloc, or even std::vector. Use a slow scripting language. Use scalar code full of doubles. Use a single lane of a GPU wavefront. Use Excel and a bunch of awful awk scripts. Whatever the shortest path is to getting some output that will validate the idea. Only then start worrying about all the real world constraints.

If things don’t work out, then all the other constraints are moot points anyway. If things do work out, we know we have the know-how to get them sorted out.

Here are a few concrete tips that might help:

  • Write down all your concerns. Cross off anything that relates to an area where you have deep experience already. Solve what’s left using a prototype. Come back to your concerns when you’re satisfied the real unknowns are solved.
  • Start new files with the comment: “Do the simplest possible thing that can work. You can fix it later.”
  • Keep prototype work off to the side to remove any idea of peer pressure early in the discovery process.
  • Practice. Participate in demo scene blitz events or game jams where the whole point is getting something working right away and then iterating on it.

Finally I want to share this fantastic piece of advice I got on Twitter from Branimir Karadžić (@bkaradzic):

 Just do it, hack! I approach code like games, rush deep into room, trigger all NPCs, die, after respawn I know where NPCs are.

When trying to break new ground, that’s what it’s all about.

Git version tagging Tundra builds

Sometimes you want to feed in the current git version into a program built with Tundra and have it automatically regenerate as needed. This is neat because embedding the latest commit in a program is a lot more predictable and robust than maintaining some random version number you might or might not remember to bump when you cut a release.

We’re going to generate a C source file with a #define that just states the git commit.

Here’s how to do it in Tundra, step by step.

1. Make sure your build has a code generation pass that runs before your compilation pass:

 Passes = {
   CodeGen = { Name = "CodeGen", BuildOrder = 1 },
 },

2. Write a shell script that generates the header file from git information. We’ll put this in gengitversion.sh:

#! /bin/sh
outfn=$1
vers=`git rev-parse --short HEAD`
echo "#define GIT_VERSION \"$vers\"" > $outfn

If you’re on Windows, you could do the same thing with a Perl script or whatever your favorite script environment is. If you want to support both UNIX and Windows in the same build, make the command an environment variable and customize it based on the config. (Or make a portable script/tool, your choice!)

3. Add a DefRule describing to Tundra how to generate the header with your script:

 require "tundra.syntax.glob"

 DefRule {
   Name       = "GitVersion",
   Command    = "./gengitversion.sh $(@)",
   Pass       = "CodeGen",
   Annotation = "GitVersion $(@)",
   Blueprint  = {
     OutputFile = { Type="string", Required = true },
   },
   Setup      = function (env, data) 
     local inputs = Glob {
       Dir = ".git/refs/heads",
       Extensions = { "" },
     }
     inputs[#inputs+1] = ".git/HEAD"
     return {
       InputFiles = inputs,
       OutputFiles = { data.OutputFile },
     }
   end,
 }

This thing looks complicated, but it’s pretty straight forward. We’re saying we have a rule to make some file out of thin air, and then in the Setup function we glob the git branches and tack on the main file git uses to track the current branch. We don’t actually use these inputs in our script, but their presence (and the glob) informs Tundra that this rule has to re-run whenever those files change, which gives us the desired property of having the latest commit in a source file.

Note how the rule definition references the pass we set up earlier, so we’re guaranteed it will run with a barrier between it and everything that could possibly include it. Passes is the mechanism Tundra uses to make implicit dependency scanning safe in multi-threaded builds with a mix of code generation and code compilation going on.

4. Invoke the rule in your source list:

 Sources = {
   GitVersion { OutputFile = "$(OBJECTDIR)/gitversion.h" },
   "other.c", "main.c", ...
 }

If you want to use the rule more than once, make it its own target with a local variable and add that variable to all source lists. Including the same rule above multiple times would generate an error, because Tundra refuses to have multiple build rules target the same output files (for good reasons!)

5. Include the generated header in one or more source files & enjoy

You’ll have to add $(OBJECTDIR) to CPPPATH as we put the generated header there.

That’s it. Hopefully this helps someone trying to do more advanced Tundra setups with code generation. Some properties of this solution:

  • The gitversion.h file will only be updated when you commit something, or add a branch, so generating it won’t slow every build down.
  • The file is generated per-config, but you could set it to ConfigInvariant to generate one shared for every config if desired.
  • We have to call out to a script to make the actual file, because Tundra will only re-run the Lua scripts when they have changed (or when the result of a glob has changed.) This is part of why Tundra is fast, but it also means that the setup is more complex than say Make.

Side note: When I was making this post I found a bug in the handling of filename extensions in the code, so to get the Glob handler to return the git refs, you’ll have to grab the latest Tundra version as of 6/21/2014. Hey, it’s not often you ask for the extension of “.git/foo/bar” and expect an empty string!

Using C++11 Capturing Lambdas w/ Vanilla C API Functions

Here’s a C++11 specific idea I’ve been applying a lot recently.

When designing libraries, I don’t want to expose implementation details as templates. I want hidden types and vanilla C ABI functions on the link level. The reasons are for compile/link speed and to be able to keep a clean binary interface between modules. (I have a separate post about this subject you might like or hate.)

For example, at Insomniac Games, our general sorting routine is defined like this:

typedef int64_t (*QuickCompareFunc)
(const void*, const void*);
typedef int64_t (*QuickCompareFuncReentrant)
(const void*, const void*, void* arg);

void QuickSort (void* __restrict elems, int64_t elem_count,
size_t elem_size, QuickCompareFunc cf);
void QuickSort (void* __restrict elems, int64_t elem_count,
size_t elem_size, QuickCompareFuncReentrant cf,
void* arg);

Note the second overload; it allows you to pass through some opaque state to the compare function. This is essential if you’re trying to sort an array of indices or something like that, where the elements in the array themselves aren’t enough to compare them.

While these functions are good for compile & link speeds, it’s easy to screw up calls while making changes under maintenance because of the typeless interface. So I went looking for a way to combine the standard calling style of these functions with more type safe lambda objects that allow closures.

It turns out it’s pretty easy to do, and it generates good code. You add a template function that gets inlined into the call site. That template function accepts any type of callback object, and funnels it via the user_data pointer. Here’s the overload for type safe + lambda goodness QuickSort():

template <typename T, typename Func>
void QuickSort(T* __restrict elems, int64_t elem_count, Func f)
{
  auto compare_closure = [](const void* ll, const void* rr,
void* context) -> int64_t
  {
    const T* l = (const T*) ll;
    const T* r = (const T*) rr;
    return (*(Func*)context)(l, r);
  };

QuickSort(elems, elem_count, sizeof elems[0], compare_closure, &f);
}

What’s going on here? Well, the function object we’re getting in (the lambda expression) is not going to be a match for the expected C function interface. So we create a new anonymous function (compare_closure) which is a match. We then pass a pointer to our function object through the user_data machinery, cast it to the right type, and call through the pointer.

What does the codegen look like? It looks good. The reason is that the compiler is smart enough to inline your entire lambda in the compare_closure, and there’s still only a single function call. One drawback I’ve found is that in unoptimized builds there’s an additional function call to jump through. Another potential drawback is if you have tons of these functions and you’re calling them a lot. Some compilers generate redundant copies of these template function bodies that the linker will have to clean up.

Armed with this template we can easily sort an array of Frobs using an index table:

Frob* frobs = ...;
size_t frob_count = ...;
int16_t* indices = ...;
QuickSort(indices, frob_count, [&](const int16_t* l,
const int16_t* r)
{
return frobs[*l].m_SortKey - frobs[*r].m_SortKey;
});

What makes this better then std::sort or similar sorting templates? There’s still only one copy of the entire sorting algorithm, so it has huge implications for code bloat if you’re calling QuickSort() a lot.

Since discovering this technique we’ve been applying it to various things (it’s jokingly called “Fredriksson’s Reverse”.) It’s pretty easy to paper over a separately-compiled routine with a typeless interface and turn it into a convenient API at low cost. All it requires is that you have some opaque pointer you can use to recover the function object.

To wrap up, here’s another example where I’m using a separately-compiled zlib state machine pump that wants to pull in I/O data. The I/O data comes from some HTTP endpoint. Normally that would be a pain in the neck to write and adapt to a C-style callback interface, but with our trusty reverse-lambda hack, it’s a breeze:

bool ok = ZlibHelpers::DecompressFile(buffer, u_sz, d_sz, scratch,
[&](void* buf, size_t size) -> bool
{
  return DevHttpClient::ReadChunkedBlock(&m_HttpState, buf, size);
});

Note that you can also access member variables (m_HttpState) and everything works as expected. My function signatures look like this (formatted for display in web browser..):

typedef bool (ZReadFn)(
void* target_buffer,
size_t read_size,
void* user_data);

bool DecompressFile(
void* buffer,
size_t compressed_size, size_t decompressed_size,
MemAllocStack* scratch,
void *user_data,
ZReadFn* read_fn);

template <typename Fn>
bool DecompressFile(
void* buffer,
size_t cs,
size_t ds,
MemAllocStack* scratch,
Fn fn)
{
auto closure = [](void* target_buffer,
size_t read_size,
void* user_data) -> bool
{
return (*(Fn*)user_data)(target_buffer, read_size);
};

return DecompressFile(buffer, cs, ds, scratch, &fn, closure);
}

Use the type system

This post isn’t very advanced, just a reminder about basics that help our lives.

The value you can squeeze out of the C and C++ type systems (for all their deficiencies) is massive.

Consider a asset loading API where there are multiple clients spread around a big engine. The goal is to allow code to request some asset data to be streamed in. Assume the asset data can be split in a few different distinct types as well, to support things like texture streaming (to split base mips away from high mips and keep two distinct data blocks around.)

One common approach is this API at the lowest level:

bool BeginAssetLoad(const char* path);

You’ll then maybe expose a couple of helper functions to make paths and assorted helpers.

But a better API is:

bool BeginAssetLoad(AssetId asset_id, AssetSubType sub_type);

Here are a few reasons why I think this style of API is better:

  1. If we decide to change how we represent asset ids or subtypes, we get compile errors for every single call site and we can fix them up statically.
  2. If we add a new required parameter, we again get compile errors and see the scope of the change.
  3. If we change our minds on how assets are represented on the file system/archive side, we have a single point of change as no one but this low-level function actually deals with the translation from asset ids to filenames or archive hashes, or whatever loading approach we use.
  4. We avoid questions of lifetime for the passed in character string.
  5. Client code can’t easily pass in invalid data.
  6. If our AssetId type is based on a simple numerical type, we can use conditional breakpoints to break when a particular asset is loaded.

Some very nice advantages. The basic idea is to retain type information for as long as possible through the data flow, and transform to a less precise form as late as possible. It’s easy to say, but sometimes hard to enforce!