» Saturday, 10 October A.D. 2009

tthsum

I promised recently that I would discuss an implementataion of tree hashing. Without further ado, here is the version I whipped up as a candidate for inclusion into Ironclad:

(defparameter *leaf-byte* (make-array 1 :element-type '(unsigned-byte 8)
                                      :initial-element 0))
(defparameter *internal-byte* (make-array 1 :element-type '(unsigned-byte 8)
                                          :initial-element 1))

(defun tree-hash (digest-name stream block-size)
  (let ((digest (crypto:make-digest digest-name))
        (buffer (make-array block-size :element-type '(unsigned-byte 8)))
        (stack nil)
        (n-blocks 0))
    (labels ((leaf-hash (amount)
               (reinitialize-instance digest)
               (crypto:update-digest digest *leaf-byte*)
               (crypto:update-digest digest buffer :end amount)
               (crypto:produce-digest digest))
             (internal-hash (b1 b2)
               (reinitialize-instance digest)
               (crypto:update-digest digest *internal-byte*)
               (crypto:update-digest digest b1)
               (crypto:update-digest digest b2)
               (crypto:produce-digest digest))
             (create-internal-node ()
               (let* ((b2 (pop stack))
                      (b1 (pop stack)))
                 (push (internal-hash b1 b2) stack))))
    (loop for amount = (read-sequence buffer stream)
       while (plusp amount)
       do (let ((leaf (leaf-hash amount)))
            (incf n-blocks)
            (push leaf stack)
            (loop for b = n-blocks then (truncate b 2)
               while (evenp b)
               do (create-internal-node)))
       finally
         (return
           (if (null stack)
               (crypto:digest-sequence digest-name *leaf-byte*)
               (loop until (null (cdr stack))
                  do (create-internal-node)
                  finally (return (car stack)))))))))

I leave it as an exercise for the reader to improve the function in several ways:

Along with some of the obvious improvements given above, the reason this isn't going into Ironclad immediately is because it exposes a flaw in Ironclad's API that should be fixed.

In the beginning, Ironclad had its digest interface. You can either do “one-shot” digests from various sources (files, streams, or sequences), or you can create digest objects and incrementally compute digests by feeding in sequences, as done above. These functions are straightforward enough and are consistent with the interfaces for crypto packages in other languages. A minor quibble is that the “one-shot” functions (DIGEST-SEQUENCE, DIGEST-FILE, and DIGEST-STREAM) could be combined into a generic function with different methods.

Along the way, Ironclad picked up an HMAC implementation. The HMAC functions (MAKE-HMAC, UPDATE-HMAC, and PRODUCE-HMAC) are remarkably similar to the digest interface (MAKE-DIGEST, UPDATE-DIGEST, and PRODUCE-DIGEST). This similarity suggests that there's some underlying generic functions that should be used for both interfaces--at least for the update/produce bits. I didn't change the interface because I suspected the HMAC functionality would be somewhat less used than the digest functionality. The same argument was used when the CMAC bits were added.

But I found as I added other bits and pieces that missing those generic functions was unfortunate. I've also found that having the digest functions be so specific produced slightly wonky code: you can see above where I've written (update-digest digest ...). I don't think that reads particularly well. So I've been considering the idea of unifying the HMAC and digest interfaces for some time.

I think it was in Martin Fowler's Refactoring that I heard this rule: if you need to do something twice, you cut-and-paste and grit your teeth about it. If you wind up needing the same thing a third time, then you should refactor. The tree hash stuff above is very similar to the existing digest interface and we already have the *MAC interfaces that could benefit from similar treatment. That's the three cases staring us in the face. So we should do some renaming/reworking of the existing interface.

But what should the names of the new generic functions be? I'd prefer to not have them be just UDPATE and PRODUCE. When I named the functions originally, I was attempting to follow the guidelines in the Tutorial on Good Lisp Programming Style (which you should read about once a year, if not more often), specifically the bit on using verb-object conventions for functions. While I think using UPDATE and PRODUCE would result in readable code--especially for UPDATE--it's not so obvious what you're producing. We need some kind of object for our generic function name. I have not thought about the problem intensely, so the best I've been able to come up with is something lame like UPDATE-STATE, which gets the idea across, but doesn't translate nicely to PRODUCE-STATE. The underlying functions in the digest interface for the individual digests use FINALIZE-FOO-STATE, which conveys the idea that the digest can't be used anymore after calling this function (correct), but it sounds like you're not getting anything back from the function. So I'm not overly fond of it for an external interface.

Suggestions welcome. Naming is hard! Maybe I should just be calling things “hashes” and then I could follow Python's hashlib module and use UPDATE and simply DIGEST, with no concern for mixing my metaphors, as it were. (If I'm going to modify the external API, I'd likely also change things so that &optional arguments get used instead of &key arguments where appropriate; UPDATE-DIGEST really acts more like subseq than, say find. But that's orthogonal to the naming discussion above.)

posted by Nate @ 8:49PM