Slides: SIMD at Insomniac Games (GDC 2015)

Here are my slides with presenter notes from my session “SIMD at Insomniac Games – How we do the shuffle” presented at GDC 2015. Enjoy!

GDC2015_Afredriksson_SIMD

3 thoughts on “Slides: SIMD at Insomniac Games (GDC 2015)

  1. Hello, Andreas. I attended this talk at GDC, and despite being pretty familiar with SIMD programming myself, I found it very useful. In particular, pushing the idea of rearranging data to SoA was enough for me to try to use that approach wherever possible. However, one thing I figured I’d bring up: You mentioned using dpps for a AoS dot product isn’t faster than the 5 op multiply, shuffle, add, shuffle, add trick (for vector4). Well, that’s been bugging me the last few weeks, because I could have sworn I did a test a couple years back and confirmed dpps was slightly better. Based on Agner Fog’s Jaguar page, dpps has an 11 cycle latency, while the other SSE trick should be 12 cycles (2+2+3+2+3). I just re-wrote the little unit test on Durango, and confirmed this for myself. It is indeed 11 and 12 cycles respectively. Try the code below if you’re curious. Swap the #if 1 for a 0 to use shuffles rather than dpps. Of course, if you’re always using SoA, anyway, this wouldn’t be all that useful 🙂 Cheers, and thanks for the very helpful slides!

    __forceinline __m128 DotProd(const __m128& a, const __m128& b)
    {
    #if 1
    return _mm_dp_ps(a, b, 255);
    #else
    __m128 v0, v1;
    v0 = _mm_mul_ps(a, b);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);
    return v0;
    #endif
    }

    const float g_v0[4] = { 3,4,5,6 };
    const float g_v1[4] = { 7,8,9,10 };

    int main()
    {
    SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);

    const unsigned loopCount = 1<<28;

    LARGE_INTEGER tmStart, tmEnd;

    const __m128 v0 = _mm_loadu_ps(g_v0);
    const __m128 v1 = _mm_loadu_ps(g_v1);
    __m128 vv;
    vv = _mm_dp_ps(v0, v1, 255);

    QueryPerformanceCounter(&tmStart);
    for (unsigned i = 0; i < loopCount; ++i)
    vv = DotProd(vv, vv);
    QueryPerformanceCounter(&tmEnd);

    printf("%f\n", (tmEnd.QuadPart-tmStart.QuadPart) / (double)loopCount);
    printf("%f\n", vv.m128_f32[0]); // force the compiler to keep the loop result

    return 0;
    }

  2. Loved this talk. I would love to hear more about people apply SIMD work loads to scenarios where SOA layouts are not a clear win. For example if an application involves a lot of random access/write of properties followed by a batch computation on all entities. The batch computation is ripe for SOA layout and SIMD code. The random access/writes which access a majority of fields (multiple cache lines sometimes) perform terribly on SOA layouts which require up to a cache miss per field access. Sometimes batching these random accesses and then using sw prefetch to hide some of the latency helps but in a collection of say a million entities, batching even a 1000 might not lead to enough common cache lines between the entities and getting this prefetching just right is pretty daunting too as the policy depends on how many cache lines one needs during the actual processing. I am always a bit perplexed about how you guys actually update the fields in your examples. For example in your 100 players and n doors problem, I am guessing that the position of the 100 players changes often. Don’t you get up to a cache miss on the x,y,z fields when updating the positions since they are possibly accessing completely random fields. Maybe with 100 players it’s not so bad, what about a 100k entities?

Leave a comment