» parallel hex conversion

Sunday, 30 September A.D. 2012 @ 8:46 PM

Today's hack is implementing conversion of octet strings to hex without branches via the wonders of vectorization:

	.type phex,@function
	.globl phex
	movupd (%rdi), %xmm1		/* Get 16 octets.  */
	movdqa %xmm1, %xmm2
	pand low_nibble_mask, %xmm1	/* Low nibbles in %xmm1.  */
	psrlw $4, %xmm2			/* High nibbles in %xmm2.  */
	pand low_nibble_mask, %xmm2

	/* Implement:

           d = nibble + ((nibble > 9) ? 'a' - 10 : '0')

	   vectorly.  */
	movdqa %xmm1, %xmm0		/* Low nibbles first.  */
	pcmpgtb nines_mask, %xmm0	/* Determine addends.  */
	movdqa ascii_zero, %xmm5
	pblendvb ascii_a, %xmm5
	movdqa %xmm2, %xmm0		/* Do the same for high nibbles.  */
	pcmpgtb nines_mask, %xmm0
	movdqa ascii_zero, %xmm6
	pblendvb ascii_a, %xmm6
	paddb %xmm5, %xmm1		/* Add ASCII codes to nibbles.  */
	paddb %xmm6, %xmm2

	/* Re-order and store hex digits.  */
	movdqa %xmm2, %xmm7
	punpckhbw %xmm1, %xmm7
	punpcklbw %xmm1, %xmm2
	movupd %xmm2, (%rsi)
	movupd %xmm7, 16(%rsi)
	.size phex,.-phex

	.section ".rodata"
	.p2align 4
	.fill 16, 1, 0x0f
	.fill 16, 1, 0x09
	.fill 16, 1, '0'
	.fill 16, 1, 'a' - 10

pblendvb is available in SSE 4.1; the core of the function can be done with pure SSE2 like so:

	movdqa %xmm1, %xmm3
	movdqa %xmm2, %xmm4
	pcmpgtb nines_mask, %xmm3
	pcmpgtb nines_mask, %xmm4
	movdqa %xmm3, %xmm5
	movdqa %xmm4, %xmm6
	pand ascii_a, %xmm5
	pand ascii_a, %xmm6
	pandn ascii_zero, %xmm3
	pandn ascii_zero, %xmm4
	por %xmm3, %xmm5
	por %xmm4, %xmm6
	paddb %xmm5, %xmm1
	paddb %xmm6, %xmm2

Fortunately, the pblendvb version is slightly faster (~5%). All in all, this is about 3.5x faster than the obvious looping method with a lookup table. Doing it with branches is a bad idea, as the branches predict poorly.

I don't know of any applications that are constrained by their ability to convert octet strings to hex, but doing a branchless vectorized algorithm sure was fun.