» Thursday, 1 October A.D. 2009

base32

Faré was asking me the other day on #lisp whether Ironclad supported Tiger Tree Hashes. I said that I didn't even know what those were, and he pointed me to the Tree Hash Exchange Format; Tiger is commonly used for the underlying hash algorithm. I thought it would be easy enough to implement, so I fiddled around with an implementation.

When I went to try to verify my implementation against the test vectors given in the above specification, however, I ran into trouble. The test vectors are given as some weird jumble of letter and numbers and it's not indicated exactly how they are encoded. Apparently hex was considered too obvious. Googling for other implementations indicated that the encoding was probably Base32, so I set out to implement at least a Base32 encoder in Lisp.

If you've ever tried to implement Base32 (or close relatives, such as Base64 or uuencode), you know there there's a lot of bit-fiddling. I attempted to make it as pretty as possible with ldb and dpb, but midway through the main encoding loop, I found myself unable to read my own code. Debugging all those magical constants wasn't going to be any fun either.

I sat back and reconsidered the problem. What you're really doing in Base32 and similar encodings is combining bytes into a big integer and then slicing that integer up in a slightly different way. Judging from implementations I have seen, stuffing all the bytes into one big integer is done for efficiency reasons, avoiding extraneous operations, and all that. But I didn't really care about efficiency (that could come later); I just wanted to make sure my implementation worked so I could get to the business of comparing to oddly-specified test vectors. And this is Lisp, anyway; we always say we can tune things for efficiency later.

After a moment of savoring the thought of giving myself permission to do things in a reasonably inefficient way and avoiding lots of bit-fiddling, I turned out:

(defun octets-to-base32 (octets)
  (let ((base32-chars "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567"))
    (with-output-to-string (stream)
      (flet ((gather (start n)
               (loop with bits = 0
                  for j from 0 below n
                  do (setf bits (logior (ash bits 8)
                                        (aref octets (+ start j))))
                  finally (return bits)))
             (output (blob n-bits)
               (loop for i from (- n-bits 5) downto 0 by 5
                  do (write-char (aref base32-chars (ldb (byte 5 i) blob))
                                 stream))))
        (loop with length = (length octets)
           for i from 0 by 5
           for j from 0 below (truncate length 5)
           do (output (gather i 5) 40)
           finally
           (unless (= i length)
             (let* ((remaining-octets (- length i))
                    (remaining-bits (* remaining-octets 8))
                    (blob (gather i remaining-octets))
                    (n-bits (* (ceiling remaining-bits 5) 5)))
               (output (ash blob (- n-bits remaining-bits)) n-bits))))))))

which was about three times shorter than other Base32 implementations I'd seen. Who cares if it was three times slower? (I didn't benchmark it to check.)

Next time: implementing the tree hasher with Ironclad and some thoughts on problems with Ironclad's API design.

posted by Nate @ 8:37PM