Optimizable Code

You don’t need to be a low-level programming expert to write fast code. The trick is to write code that CAN be optimized. (And the trick to doing that is to worry about memory accesses.)

For example, consider this piece of code:

class GameObject {
  float m_Pos[2];
  float m_Velocity[2];
  char m_Name[32];
  Model* m_Model;
  // ... other members ...
  float m_Foo;

  void UpdateFoo(float f)
  {
    float mag = sqrtf(
      m_Velocity[0] * m_Velocity[0] +
      m_Velocity[1] * m_Velocity[1]);
      m_Foo += mag * f;
  }
};

Let’s say you have a 100 or so of these game objects, and they’re spread out randomly in memory. Every frame you do this:

for each game object pointer p:
 p->UpdateFoo(f)

It’s starting to show up in the profiler as a hot-spot. What’s going on?

First, we need to understand where the cost is coming from. Let’s have a look at the UpdateFoo() function again. What’s expensive here?

void UpdateFoo(float f)
{
  float mag = sqrtf( // <= Square root
  m_Velocity[0] * m_Velocity[0] + // <= Memory read
  m_Velocity[1] * m_Velocity[1]); // <= Float math
  m_Foo += mag * f; // <= Memory r/m/w, Float math
}

OK, so we have a mix of things going on here. Your instinct might be to think of the square root as the expensive part of the update, and 15 years ago you might have been right. But nowadays, memory accesses that miss cache are are usually at least an order of magnitude more expensive than a square root.

A useful metric when eyeballing some code is to guess at the worst case execution time in terms of L2 cache misses.

Assume some x86-based platform, with a typical cache line size of 64 bytes. Let’s say a L2 cache miss is 150 cycles (this is pretty typical.) The square root will likely be between 7-30 cycles depending on the microarchitecture (technically that’s latency and not “wait cycles”, but for our purposes here that will do.) The floating point math on a chip like this usually has a latency of 1-3 cycles per operation, but you can issue a bunch of them all at once, so for this excercise let’s assume they’re 10 cycles for the entire function (it’ll probably be less.)

Here’s our breakdown then:

  • Access m_Velocity – 150 cycles
  • Access m_Foo – 150 cycles (it’s on a separate cache line)
  • Float math: 10 cycles
  • Square root: 30 cycles

So we’re waiting 300 cycles on memory to complete 40 cycles of real work.

How can we optimize this?

Well, we’ve established that 88% (300 of 340 cycles) is spent just waiting for memory to come in. Even if we get rid of the square root, or replace it with some approximation, we haven’t really moved the needle at all.

We can get rid of the cache miss for m_Foo by moving it closer to m_Velocity. That saves us 150 cycles, and brings us down from 340 cycles to 190. But we’re still spending 79% (150 of 190 cycles) of the time waiting for memory.

To really speed this up, we have to rearrange the data. From the CPU’s point of view, the ideal case is a tightly packed, sequential block of RAM that we can stream through. Let’s change our design:

struct FooUpdateIn {
  float m_Velocity[2];
  float m_Foo;
};

struct FooUpdateOut {
  float m_Foo;
};

void UpdateFoos(const FooUpdateIn* in, size_t count, FooUpdateOut* out, float f)
{
  for (size_t i = 0; i < count; ++i) {
    float mag = sqrtf(
      in[i].m_Velocity[0] * in[i].m_Velocity[0] +
      in[i].m_Velocity[1] * in[i].m_Velocity[1]);
      out[i].m_Foo = in[i].m_Foo + mag * f;
  }
}

Notice that the actual code to compute the output has not changed. The number of memory accesses are the same. The key reason this routine will perform better than calling the original UpdateFoo() on objects scattered in memory is memory locality. The FooUpdateIn structure is 12 bytes. That means we can fit 5+ (5.33) of these into a single cache line. If we go back to our “worst case” analysis method we find that only every fifth iteration through the loop will pay for a memory access for the input structure.

But in practice most x86 CPUs have h/w prefetching units which will detect strided memory accesses like this and prefetch the upcoming lines into the cache, so the cost is actually much lower. We’re now doing roughly 5 x 40 = 200 cycles of work per cache line of input data. This completely overlaps the latency the hardware prefetcher needs to pull in the next line from main memory into the cache. So by transforming our update problem into a loop that plays well with the hardware we’ve essentially REMOVED the cache miss cost, which according to our eyeballing was around 90% of the time. That’s a 10x improvement, with no SIMD, assembly or other tricks anywhere to be found!

(Automatic h/w prefetchers typically only load into the L2 cache, and we’ll still be stuck with a L1 miss for each new cache line. But that’s easy to amend with a L1 prefetch hint when and if you need it.)

This code is now also suitable for further low-level optimization. We can much more easily reorder the input and outputs to use SIMD, SOA data layout and prefetch instructions to get the last bit of performance out of this. But that’s maybe an additional 2-4x speedup compared to the 10x speedup we got from rearranging the data structures and we can probably push that off knowing that we can reach for that hammer later.

This transformation didn’t touch the inner computation itself, but it affected every single design choice around it. That’s why it’s so important to write optimizable code from the start, keeping the slowest part of the machine in mind: the memory subsystem.

As a project is coming to a close and you need a few extra ms of optimization to hit the frame rate, you’re going to want functions like UpdateFoos() above that are easily adapted to low-level techniques like SIMD and prefetching. If the entire engine is designed around the first example’s “one at a time” approach, there’s just not enough work to sandwich inbetween the memory accesses, and optimizing is going to be very time-consuming as the code needs to be rewritten.

To boil this down into a few takeaways:

  • Memory latency is the most important design constraint when designing for performance
  • Use “number of memory accesses to distinct cache lines” as a rough indicator of worst case performance.
  • Worrying about memory layout right away gets you 90% of the way, without low-level coding
  • Forget about “optimal” or advanced code. Just get the basics right so you can optimize later if you need to.
  • Strive to design your routines as loops working on more than one item in memory order.
  • Try to feed each such routine with only the data it needs to do its job, as tightly packed as possible. This might involve splitting up large logical objects into distinct pieces.
  • When updates are happening in memory order with minimal waste, take another look at your profiler data to see if SIMD or prefetching is in order. (But most likely it’s already plenty fast.)
Advertisements

One thought on “Optimizable Code

  1. Very useful technique, thanks. I’m just a bit worried about the final GameObject structure and API. I wonder if it’s possible to wrap access to optimised GameObject structures with convenient to use API?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s