Color interpolation w/ 64-bit multiplications

This is a little trick which can be useful when you want to interpolate between RGBA colors kept in 32-bit integers. Maybe you’re setting up an old-school vertex stream with a few colors here and there. I’m sure there are hundreds of descriptions of the algorithm, but here is my implementation in all its glory:

u32 mix_colors(u32 a, u32 b, u32 factor)
{
	u32 af = 256 - factor;
	u32 bf = factor;

	// u32 abgr -> 0a0g0b0r

	u64 al = (a & 0x00ff00ff) | (((u64)(a & 0xff00ff00)) << 24);
	u64 bl = (b & 0x00ff00ff) | (((u64)(b & 0xff00ff00)) << 24);

	// -> a_g_b_r_
	u64 mix = (al * af + bl * bf);

	// shift & combine back into 32-bit quantity
	u32 result = ((mix >> 32) & 0xff00ff00) | ((mix & 0xff00ff00) >> 8);

	return result;
}

Basically you call the function with two packed 32-bit RGBA values you want to interpolate between, and a factor in the range 0-256 that determines how much of A and B you get (except for some slight precision loss due to truncation). The cool trick is to do the four additions and multiplications in parallel by exploiting the wider registers on 64-bit architectures.

This generates pretty decent code on x86-64 which has 64-bit mul:

        mov     %edi, %edi
        mov     %esi, %esi
        movq    %rsi, %rax
        andl    $4278255360, %eax
        salq    $24, %rax
        andl    $16711935, %esi
        orq     %rsi, %rax
        mov     %edx, %ecx
        imulq   %rcx, %rax
        movl    $256, %ecx
        subl    %edx, %ecx
        movq    %rdi, %rdx
        andl    $4278255360, %edx
        salq    $24, %rdx
        andl    $16711935, %edi
        orq     %rdi, %rdx
        imulq   %rdx, %rcx
        addq    %rcx, %rax
        movq    %rax, %rdx
        andl    $4278255360, %edx
        shrq    $8, %rdx
        shrq    $32, %rax
        andl    $-16711936, %eax
        orl     %edx, %eax

On the basic x86 instruction set, it’s not such a good idea:

        xorl    %edx, %edx
        movl    %eax, -48(%ebp)
        movl    %edx, -44(%ebp)
        movl    12(%ebp), %eax
        movl    %eax, %ecx
        andl    $-16711936, %ecx
        xorl    %ebx, %ebx
        shldl   $24, %ecx, %ebx
        sall    $24, %ecx
        andl    $16711935, %eax
        xorl    %edx, %edx
        movl    %ecx, %esi
        orl     %eax, %esi
        movl    %esi, -24(%ebp)
        movl    %ebx, %esi
        orl     %edx, %esi
        movl    %esi, -20(%ebp)
        movl    16(%ebp), %eax
        movl    %eax, -56(%ebp)
        movl    $0, -52(%ebp)
        movl    %esi, %ecx
        imull   -56(%ebp), %ecx
        movl    -24(%ebp), %eax
        mull    -56(%ebp)
        movl    %eax, -32(%ebp)
        addl    %ecx, %edx
        movl    %edx, -28(%ebp)
        movl    $256, %eax
        subl    16(%ebp), %eax
        movl    %eax, -40(%ebp)
        movl    $0, -36(%ebp)
        movl    -48(%ebp), %eax
        andl    $-16711936, %eax
        xorl    %edx, %edx
        shldl   $24, %eax, %edx
        sall    $24, %eax
        movl    -48(%ebp), %esi
        andl    $16711935, %esi
        xorl    %edi, %edi
        orl     %esi, %eax
        orl     %edi, %edx
        movl    -40(%ebp), %ecx
        imull   %edx, %ecx
        mull    -40(%ebp)
        leal    (%ecx,%edx), %edx
        movl    -32(%ebp), %ebx
        movl    -28(%ebp), %esi
        addl    %eax, %ebx
        adcl    %edx, %esi
        movl    %ebx, %edx
        andl    $-16711936, %edx
        xorl    %ecx, %ecx
        shrdl   $8, %ecx, %edx
        shrl    $8, %ecx
        movl    %esi, %ebx
        xorl    %esi, %esi
        movl    %ebx, %eax
        andl    $-16711936, %eax
        orl     %edx, %eax

Here it would have been better to rewrite it as two 32-bit operations. Morale of the story? Always look at the disassembly to verify assumptions!

Aside: It would be possible to make an even better performing implementation that utilized SIMD instructions if we could arrange batched data that was layed out properly. That’s left as an exercise for the reader ๐Ÿ™‚

Advertisements

2 thoughts on “Color interpolation w/ 64-bit multiplications

  1. 32-bit version is faster on both x86 and x64, as compiled with MSVC2010 /Ox (Core i7):

    u32 mix_colors_32(u32 a, u32 b, u32 factor)
    {
    u32 af = 256 – factor;
    u32 bf = factor;

    // u32 abgr -> 0a0g0b0r

    u32 al = (a & 0x00ff00ff);
    u32 ah = (a & 0xff00ff00) >> 8;
    u32 bl = (b & 0x00ff00ff);
    u32 bh = (b & 0xff00ff00) >> 8;

    // -> a_g_b_r_
    u32 ml = (al * af + bl * bf);
    u32 mh = (ah * af + bh * bf);

    // shift & combine back into 32-bit quantity
    u32 result = (mh & 0xff00ff00) | ((ml & 0xff00ff00) >> 8);

    return result;
    }

    So, I guess, the morale of the story should be different ๐Ÿ˜‰

    • Yeah I suspected as much (64-bit muls are usually slow). Didn’t say anything about performance on the scalar version, just that it’s good to check the disassembly to see that you’re getting roughly what you think. Morale still holds.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s