The Register Pressure Twilight Zone

At Insomniac Games we’re currently doing a performance push which means lots of opportunity to look at compiler output. In this post I want to share a story on fighting a particular compiler optimization that was causing me problems.

One of my areas has been the decal system which is very heavy on 3d data processing — a perfect system for SIMD work. While working on one of the bigger compute loops I noticed something odd about the generated PPC code. All of the actual computation work was being done on the vector unit (as it should!), but there was a lot of stack traffic originating from the integer unit. The compiler was overcommitting integer registers it seemed, in what was essentially a pure-SIMD loop. Puzzled, I dug in and tried to work out what was going on.

Let’s look at the code. The loop I was working on processes 4 triangles at a time and has a structure like this:

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  VecSimd t1a = SimdLoad(&input[0].m_Position);
  VecSimd t1b = SimdLoad(&input[1].m_Position);
  VecSimd t1c = SimdLoad(&input[2].m_Position);
  // ... 9 more loads
  VecSimd n1a = SimdLoad(&input[0].m_Normal);
  VecSimd n1b = SimdLoad(&input[1].m_Normal);
  VecSimd n1c = SimdLoad(&input[2].m_Normal);
  // ... 9 more loads
  
  // (crunching)

  SimdStore(out0, &output[0].m_Position);
  SimdStore(out1, &output[1].m_Position);
  SimdStore(out2, &output[2].m_Position);
  // lots more stores

  // increment input and output pointers
  input += 12;
  output += 12;
}

What I found was that the vector loads were all being done from different (integer) registers and they were all being incremented individually by the compiler where I was expecting a single base register and 12 offsets. Because there are so many loads and stores going on, the compiler would run out of registers and start spilling them to the stack, generating load-hit-stores and memory penalties all over the loop!

Mike Day offered the following insight:

… in the case where there’s a sensible number of pointers – small enough not to incur spilling onto the stack – the ‘create n pointers’ approach is sensible for the compiler as long as it follows through by doing all the loads using the indexed addressing form of the load instruction. […] The benefit is that instead of incrementing n pointers per pass (or one pointer n times), it only needs to increment a single offset by n times the stride, saving n-1 adds.

It turns out the optimizers for both our PPC compilers were so keen on using the indexed vector load instruction with a single increment that they would blindly overcommit the integer register pool to achieve that particular instruction selection.

The question then became–how do we force the optimizer to stop doing it? The optimizer can see that the input and output pointers are being moved consistently by a certain stride and any attempt on my part to use variations in the indexing expressions just ended up generating slightly different variations of the same bad behavior.

So let’s look at what we’d want if we were writing the loop ourselves in assembly. In this case a better code generation would be to have a single input register and increment it after each SIMD load, using just one register. The key to accomplish this code generation is in C++ to invalidate the optimizer’s assumptions about the pointers and strides so it can’t create an array of independent pointers. But how do we do that? After all, everything here is using linear memory accesses with a stride that’s well known at compile time, enabling the unwanted optimization.

Time for some trickery! The loop structure I ended up using looks like this:

static volatile uintptr_t s_secret_sauce = uintptr_t(ptrdiff_t(-1));
static volatile uint32_t  s_stride       = sizeof(Vertex);

// load our constants in from memory into registers
const uintptr_t secret_sauce             = s_secret_sauce;
const uint32_t  stride                   = s_stride;

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  // establish base pointer for all loads in the loop
  uintptr_t base = (uintptr_t(input) & secret_sauce) +
                   offsetof(Vertex, m_Position);

  // load and bump base pointer with stride
  VecSimd t1a = SimdLoad((void*)base); base += stride;
  VecSimd t1b = SimdLoad((void*)base); base += stride;
  VecSimd t1c = SimdLoad((void*)base); base += stride;

  // ... more loads
 
  // update input pointer for next loop iteration
  input = (Vertex*) base;

  // ... rest of loop  

  // ... stores handled similarily
}

There are two key things going on in this code that breaks the optimization pattern:

  • We’re loading stride and secret_sauce from memory before the loop, thereby preventing any compile-time knowledge of them.
  • We’re breaking any loop-wide analysis of the input and output pointers by masking with a value that is unknown to the optimizer, forcing it to rely on the sequence of statements we’ve laid out exactly.
  • This generated the desired instruction selection and removed all stack spills from the loop. In one particular case I was using as a performance test case this saved over 30k instructions over the loop lifetime, a significant chunk of work. It also removed several hundred load-hit-store penalties.

    This does generate an additional and instruction which is an artifact of the technique, but compared to the much worse stack spilling code this the way better deal. The and could also be replaced with an add of zero or something similar, but it will just amount to the same 1 or 2-cycle overhead in the end.

    This has to have been the first time I’ve used static volatile variables to improve performance. Hopefully it’s useful to someone else encountering this behavior!

    Advertisements

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s