Slides: CEDEC 2017

Here are my slides from my talk at CEDEC 2017 titled “More performance, more gameplay”. This time I removed all the build transitions from the PDF, so some builds will be hard to follow, but it makes for a shorter read. Enjoy.

Fredriksson_CEDEC_2017_Final

Why Qt and not IMGUI?

Here’s some bonus chatter after my web talk postmortem at GDC 2017. I’ll try to answer a bunch of FAQs here so I can refer people to them rather than answering everyone individually 🙂

Why didn’t you chose IMGUI for the client reboot?

I think IMGUI is great for in-game UI and read-only presentation of stuff, but it is not attractive for a small team working on desktop tools. We did consider switching to an immediate approach, but were a few strong reasons we didn’t.

First of all, there are no off the shelf IMGUI libraries that can compete with traditional GUI frameworks like Qt when it comes to features and completeness. People at Insomniac expect a robust familiar looking UI with good performance. But they also expect a lot of power tools to come with that. Things like:

Filtering and sorting every list and tree view
Copy and paste
Multi-selection logic with order preservation
Undo/redo
Drag and drop with preview and animation
Full unicode support (i.e. localization)

Yes, you can totally do all that in IMGUI. But it’s a lot of code to write and maintain, and before you have written it you have nothing. And we had just escaped the hell of writing every control ourselves in the web space, so that was not something we wanted to start this project off by doing.

Second, we only have 5 programmers on the tool side. Starting an IMGUI reboot of the level editor would involve fun decisions like:

How are we going to render fonts?
How are we going to load icons?
How do we handle resizing layouts?
How do we handle high DPI?

Bikeshedding opportunities for all! Using a framework that has reasonable answers to all those questions already means you can start doing what’s important from day one, which is make the damn tool, instead of inventing yet another font renderer.

But I’ve heard that traditional frameworks suck?

Well, they have a lot of baggage, sure. And you definitely don’t want to be using anything like MFC or classic Win32 where the widgets take ownership of the data. Qt’s model-view architecture however is pretty sane.

You implement a model interface, telling the library how to access (and edit) your data and then it handles input and rendering for the views. The views are completely stock and run unmodified, with standard look and feel, sane multi-selection and keyboard behavior, editing support etc etc. The cost is that you have to express your data access using a framework-defined interface. That’s a pretty small price to pay for the massive leapfrogging you get by relying on those stock views.

The above is true for all “multiple things” widgets, like combo boxes, list and tree views. What remains are the “scalar” widgets, like a single string, checkbox etc. Yes, they retain state so they can repaint themselves without involving your code. And yes, you need to keep them in sync with your data by connecting signals to slots. But it also means that the data is not committed automatically as soon as the checkbox changes. You can actually complain and have the user correct their settings before committing. Tradeoffs.

Are you using Qt Quick?

No, we’re sticking with traditional Qt widgets only. The reasons are:

We want perfect Win32 interop for our native 3D view
We’re sick and tired of Javascript, and Qt Quick uses CSS/JS-like concepts

But don’t custom widgets in Qt suck?

No, they’re in fact just like IMGUI if you do things correctly! The differences are:

You’re painting using a traditional pixel-oriented API rather than a 3D API
Mostly the framework decides when you need to repaint, not you
But if your data changes, you’ll need to schedule a repaint yourself (which is easily accomplished in a decoupled fashion using signals and slots)

But other than that, you’re painting a view of your app’s data directly to a drawing surface. So in fact, it’s more like IMGUI than you might think.

Slides: Insomniac’s Web Tools Postmortem

Here are the slides from my postmortem presentation at GDC17 today on Insomniac’s web tools.

webtoolspostmortem

Slides: Insomniac Games’ CacheSim

Here are my slides from this morning’s cache simulator talk at GDC17.

The open source repository is at https://github.com/InsomniacGames/ig-cachesim

cachesimstyled

Slides: Taming the Jaguar (GDC2016)

Here are the slides from my GDC 2016 talk: “Taming the Jaguar”.

GDC16_Fredriksson_Jaguar

Slides: SIMD at Insomniac Games (GDC 2015)

Here are my slides with presenter notes from my session “SIMD at Insomniac Games – How we do the shuffle” presented at GDC 2015. Enjoy!

GDC2015_Afredriksson_SIMD

Walking Set Bits

Sometimes you want to walk the set bits of an integer and do something for each bit. For example, consider printing a human readable version of a flag field (which could have up to say 32 unique flags set).

The naive way to accomplish this is of course to visit every bit and see if something is set. But we’re cooler than that.

A better approach is to repeatedly pick off the lowest set bit until there’s nothing left. Here’s what that looks like:

while (bits) {
  uint32_t lsb = bits & ~(bits - 1);
  // do something with lsb here..
  bits &= ~lsb; // mask off lsb and keep going
}

This gives you the value of each bit as you’re walking it (i.e. bit 7 will have the value 0x80 if we’re counting from 0). If you need the bit’s position as well, it’s convenient to use the Intel BSF (bit scan forward) instruction via compiler intrinsics:

  int bitpos;
  while (0 != (bitpos = __builtin_ffs(bits))) {
    uint32_t lsb = 1 << (bitpos - 1);
    // use bitpos, lsb as required..
    // bitpos is 1-based on GCC
    bits &= ~lsb;
  }

It sort of sucks that we still have to compute the LSB value so we can remove it and move on to the next bit. If your CPU has a fast CTZ instructions (count trailing zeroes) we can substitute that for the barrel shift:

while (bits) {
  uint32_t lsb = bits & ~(bits - 1);
  int bitpos = __builtin_ctz(lsb);
  // use bitpos, lsb as required..
  bits &= ~lsb; // mask off lsb and keep going
}

The intrinsics shown here work on GCC and Clang. There are similar ones for MSVC but I’m too lazy to write Windows examples!

Does experience slow you down?

This is somewhat of a philosophical post inspired by a good discussion we had in my engine group at Insomniac Games that also spilled out on Twitter. The question I asked was: “Do you think having more experience slows you down? And how do you combat that?”

The question seems illogical at first. Having more experience should make us better and therefore faster at getting results, right? That might certainly be true for a lot of problems, but when we’re trying to innovate and break new ground having many years of experience can lead to a condition known as “analysis paralysis” where we find ourselves unable to get started because we’re finding problems with our design before it even exists.

Of course that type of deep analysis has massive value itself. We want our designs to be efficient, make good use of memory, have a great UX and scale up to whatever production constraints we might face. But if we worry about those things too early we might never get to the points where we can know if an idea has any value because we never even get started.

In my experience this creeps up in particular when trying to innovate and break new ground. We have an idea that something would be useful and solve some particular problem. But years of experience with similar or related solutions might lead us down the analysis rabbit hole and we start to ask questions like:

How can I reduce the memory waste of this thing?
How will this thing scale to 10x the number of objects?
Will this approach lend itself to SIMD or a GPGPU implementation?
How will this work with 100 team members?
How is source control going to work?
How is it going to be network synchronized?
How can I maximize cache throughput for the data?

It’s easy to discard good ideas because we project there will be some future possible problems with them. But if we were actually to try the idea we might learn that only a few (or none!) of those concerns are actually valid based on what we find trying to implement it!

For example, if you’re just starting out and all you know is malloc() and linear searches, then everything you try to program will naturally map to those two concepts. There’s nothing to think about, and the neophyte will start programming right away to try his idea out. But someone who has 10+ years of experience of tuning custom memory allocators or crafting customized data structures will naturally have a tendency to think much deeper about a possible design in those two areas and disqualify every possible solution based on some theoretical problem it might encounter.

After two weeks the neophyte has a working prototype of his new thing while the battle-scarred veteran has 4 sheets of papers with half-formed plans for memory optimization and data structure sketches, or network synchronization problem dead ends or whatever his particular domain is.

At this point, what is the better position to be in? Being able to judge the merits of a naive implementation that actually runs? Or not having a prototype implementation at all because we don’t see how we can get past some future problem we might run into if we ever get there?

I’d argue that the naive start is way better, and cheaper too. If the solution has value and actually solves some problem (however poorly) we can apply optimization experience much more easily to improve the design incrementally. Furthermore, all the hard data dependencies will naturally have been shaken out by actually by taking an idea from start to finish which means the space is constrained and optimizations fall out much more naturally.

My advice: don’t be afraid to write “bad” code to test an idea out. Violate your own basic principles. Use malloc, or even std::vector. Use a slow scripting language. Use scalar code full of doubles. Use a single lane of a GPU wavefront. Use Excel and a bunch of awful awk scripts. Whatever the shortest path is to getting some output that will validate the idea. Only then start worrying about all the real world constraints.

If things don’t work out, then all the other constraints are moot points anyway. If things do work out, we know we have the know-how to get them sorted out.

Here are a few concrete tips that might help:

Write down all your concerns. Cross off anything that relates to an area where you have deep experience already. Solve what’s left using a prototype. Come back to your concerns when you’re satisfied the real unknowns are solved.
Start new files with the comment: “Do the simplest possible thing that can work. You can fix it later.”
Keep prototype work off to the side to remove any idea of peer pressure early in the discovery process.
Practice. Participate in demo scene blitz events or game jams where the whole point is getting something working right away and then iterating on it.

Finally I want to share this fantastic piece of advice I got on Twitter from Branimir Karadžić (@bkaradzic):

Just do it, hack! I approach code like games, rush deep into room, trigger all NPCs, die, after respawn I know where NPCs are.

When trying to break new ground, that’s what it’s all about.

Git version tagging Tundra builds

Sometimes you want to feed in the current git version into a program built with Tundra and have it automatically regenerate as needed. This is neat because embedding the latest commit in a program is a lot more predictable and robust than maintaining some random version number you might or might not remember to bump when you cut a release.

We’re going to generate a C source file with a #define that just states the git commit.

Here’s how to do it in Tundra, step by step.

1. Make sure your build has a code generation pass that runs before your compilation pass:

 Passes = {
   CodeGen = { Name = "CodeGen", BuildOrder = 1 },
 },

2. Write a shell script that generates the header file from git information. We’ll put this in gengitversion.sh:

#! /bin/sh
outfn=$1
vers=`git rev-parse --short HEAD`
echo "#define GIT_VERSION \"$vers\"" > $outfn

If you’re on Windows, you could do the same thing with a Perl script or whatever your favorite script environment is. If you want to support both UNIX and Windows in the same build, make the command an environment variable and customize it based on the config. (Or make a portable script/tool, your choice!)

3. Add a DefRule describing to Tundra how to generate the header with your script:

 require "tundra.syntax.glob"

 DefRule {
   Name       = "GitVersion",
   Command    = "./gengitversion.sh $(@)",
   Pass       = "CodeGen",
   Annotation = "GitVersion $(@)",
   Blueprint  = {
     OutputFile = { Type="string", Required = true },
   },
   Setup      = function (env, data) 
     local inputs = Glob {
       Dir = ".git/refs/heads",
       Extensions = { "" },
     }
     inputs[#inputs+1] = ".git/HEAD"
     return {
       InputFiles = inputs,
       OutputFiles = { data.OutputFile },
     }
   end,
 }

This thing looks complicated, but it’s pretty straight forward. We’re saying we have a rule to make some file out of thin air, and then in the Setup function we glob the git branches and tack on the main file git uses to track the current branch. We don’t actually use these inputs in our script, but their presence (and the glob) informs Tundra that this rule has to re-run whenever those files change, which gives us the desired property of having the latest commit in a source file.

Note how the rule definition references the pass we set up earlier, so we’re guaranteed it will run with a barrier between it and everything that could possibly include it. Passes is the mechanism Tundra uses to make implicit dependency scanning safe in multi-threaded builds with a mix of code generation and code compilation going on.

4. Invoke the rule in your source list:

 Sources = {
   GitVersion { OutputFile = "$(OBJECTDIR)/gitversion.h" },
   "other.c", "main.c", ...
 }

If you want to use the rule more than once, make it its own target with a local variable and add that variable to all source lists. Including the same rule above multiple times would generate an error, because Tundra refuses to have multiple build rules target the same output files (for good reasons!)

5. Include the generated header in one or more source files & enjoy

You’ll have to add $(OBJECTDIR) to CPPPATH as we put the generated header there.

That’s it. Hopefully this helps someone trying to do more advanced Tundra setups with code generation. Some properties of this solution:

The gitversion.h file will only be updated when you commit something, or add a branch, so generating it won’t slow every build down.
The file is generated per-config, but you could set it to ConfigInvariant to generate one shared for every config if desired.
We have to call out to a script to make the actual file, because Tundra will only re-run the Lua scripts when they have changed (or when the result of a glob has changed.) This is part of why Tundra is fast, but it also means that the setup is more complex than say Make.

Side note: When I was making this post I found a bug in the handling of filename extensions in the code, so to get the Glob handler to return the git refs, you’ll have to grab the latest Tundra version as of 6/21/2014. Hey, it’s not often you ask for the extension of “.git/foo/bar” and expect an empty string!

Using C++11 Capturing Lambdas w/ Vanilla C API Functions

Here’s a C++11 specific idea I’ve been applying a lot recently.

When designing libraries, I don’t want to expose implementation details as templates. I want hidden types and vanilla C ABI functions on the link level. The reasons are for compile/link speed and to be able to keep a clean binary interface between modules. (I have a separate post about this subject you might like or hate.)

For example, at Insomniac Games, our general sorting routine is defined like this:

typedef int64_t (*QuickCompareFunc)
                (const void*, const void*);
typedef int64_t (*QuickCompareFuncReentrant)
                (const void*, const void*, void* arg);

void QuickSort (void* __restrict elems, int64_t elem_count,
                size_t elem_size, QuickCompareFunc cf);
void QuickSort (void* __restrict elems, int64_t elem_count,
                size_t elem_size, QuickCompareFuncReentrant cf,
                void* arg);

Note the second overload; it allows you to pass through some opaque state to the compare function. This is essential if you’re trying to sort an array of indices or something like that, where the elements in the array themselves aren’t enough to compare them.

While these functions are good for compile & link speeds, it’s easy to screw up calls while making changes under maintenance because of the typeless interface. So I went looking for a way to combine the standard calling style of these functions with more type safe lambda objects that allow closures.

It turns out it’s pretty easy to do, and it generates good code. You add a template function that gets inlined into the call site. That template function accepts any type of callback object, and funnels it via the user_data pointer. Here’s the overload for type safe + lambda goodness QuickSort():

template <typename T, typename Func>
void QuickSort(T* __restrict elems, int64_t elem_count, Func f)
{
  auto compare_closure = [](const void* ll, const void* rr,
                            void* context) -> int64_t
  {
    const T* l = (const T*) ll;
    const T* r = (const T*) rr;
    return (*(Func*)context)(l, r);
  };

  QuickSort(elems, elem_count, sizeof elems[0], compare_closure, &f);
}

What’s going on here? Well, the function object we’re getting in (the lambda expression) is not going to be a match for the expected C function interface. So we create a new anonymous function (compare_closure) which is a match. We then pass a pointer to our function object through the user_data machinery, cast it to the right type, and call through the pointer.

What does the codegen look like? It looks good. The reason is that the compiler is smart enough to inline your entire lambda in the compare_closure, and there’s still only a single function call. One drawback I’ve found is that in unoptimized builds there’s an additional function call to jump through. Another potential drawback is if you have tons of these functions and you’re calling them a lot. Some compilers generate redundant copies of these template function bodies that the linker will have to clean up.

Armed with this template we can easily sort an array of Frobs using an index table:

Frob* frobs = ...;
size_t frob_count = ...;
int16_t* indices = ...;
QuickSort(indices, frob_count, [&](const int16_t* l,
                                   const int16_t* r)
{
  return frobs[*l].m_SortKey - frobs[*r].m_SortKey;
});

What makes this better then std::sort or similar sorting templates? There’s still only one copy of the entire sorting algorithm, so it has huge implications for code bloat if you’re calling QuickSort() a lot.

Since discovering this technique we’ve been applying it to various things (it’s jokingly called “Fredriksson’s Reverse”.) It’s pretty easy to paper over a separately-compiled routine with a typeless interface and turn it into a convenient API at low cost. All it requires is that you have some opaque pointer you can use to recover the function object.

To wrap up, here’s another example where I’m using a separately-compiled zlib state machine pump that wants to pull in I/O data. The I/O data comes from some HTTP endpoint. Normally that would be a pain in the neck to write and adapt to a C-style callback interface, but with our trusty reverse-lambda hack, it’s a breeze:

bool ok = ZlibHelpers::DecompressFile(buffer, u_sz, d_sz, scratch,
[&](void* buf, size_t size) -> bool
{
  return DevHttpClient::ReadChunkedBlock(&m_HttpState, buf, size);
});

Note that you can also access member variables (m_HttpState) and everything works as expected. My function signatures look like this (formatted for display in web browser..):

typedef bool (ZReadFn)(
  void* target_buffer,
  size_t read_size,
  void* user_data);

bool DecompressFile(
  void* buffer,
  size_t compressed_size,
  size_t decompressed_size,
  MemAllocStack* scratch,
  void *user_data,
  ZReadFn* read_fn);

template <typename Fn>
bool DecompressFile(
  void* buffer,
  size_t cs,
  size_t ds,
  MemAllocStack* scratch,
  Fn fn)
{
 auto closure = [](void* target_buffer,
                   size_t read_size,
                   void* user_data) -> bool
 {
   return (*(Fn*)user_data)(target_buffer, read_size);
 };

 return DecompressFile(buffer, cs, ds, scratch, &fn, closure);
}

Andreas on Coding

Thoughts about programming.

Menu