» Wednesday, 19 March A.D. 2008
new trees release
Long ago, when I was still tweaking my blog software, I decided I wanted to eschew databases. Instead, I wrote a simple binary tree package for an in-memory database of sorts (my blog was getting to large to deal with linear lists and I wanted things sorted by date at the very least). It worked for me and didn't seem to erase my hard drive, so I released it to the world as TREES. I marveled at my contribution to Lisp software.
However, over the years, bug reports came trickling in, mostly saying, “Hey, I tried AVL trees and deleting elements from AVL trees goes into an infinite loop.” Like a negligent maintainer, I paid little attention to the reports, as I used red-black trees and they Worked For Me (tm). Not the greatest contribution to Lisp, eh?
I recently decided that life is too short to have buggy software with one's name floating around. Over the past month or so, I have been rewriting TREES and released version 0.10 yesterday, with improved interfaces, more efficient code, and a real testsuite. I thought I'd share a few lessons from my grand rewrite:
- Follow established Common Lisp conventions. Erik Naggum wrote a post about this sort of thing a while back, and I have tried to follow his suggestions. But I also applied existing conventions in other ways too: the lambda list for making a binary tree prior to the rewrite was something ugly like (tree-kind &key compfun eqfun keyfun). After thinking about established convention and how binary trees really work, it's now (tree-kind pred &key key test), which seems more obvious to my mind, as well as saying, “an ordering for binary trees is not an optional feature.”
- Also, if you're going to write DO-style macros, make sure the body is surrounded by tagbody. I learned about this nifty feature recently--that one can place go tags in the body of such macros and use them to implement things like C's continue. Very handy.
- Randomized testing is a good thing. I built several million random trees of varying sizes while testing the whole thing and doing so quickly ferreted out bugs in my initial implementations. In later stages, I'd leave my computer churning away overnight to ensure I hadn't broken anything.
- As a corollary, make sure you have some way of replaying the building of a randomized thing so you can debug your code when things go pear-shaped.
- I discovered a practical use for non-standard method combinations: I implemented a VALIDATE-TREE function that checks invariants for different kinds of binary trees. But when checking, say, the invariants applying to an AVL tree, one should also check the invariants applying to basic binary trees, preferably without code duplication. Solution? An AND method combination: both the basic- and AVL-specific methods get run and their return values are combined for you.
- Profile your test suite. The faster your test suite goes, the more testing you can do--especially when doing randomized testing. I sped my tests (at an early stage) up by a factor of 30 after profiling. A quirk in how SBCL implements type declarations for WITH clauses in loop inspired me to rewrite how I kept track of some things in my testing, leading to a ~20x speedup. And using (lambda (x y) (< x y)) is faster than (function <) when you know you're going to be dealing with fixnum data; that SBCL-specific tweak lead to another factor of ~1.5x.
- In the same vein: structures are faster than classes. Really. I knew this already, but I didn't have a good feel for how much those accesses-as-function calls cost me. Tests sped up by ~20% when I reimplemented the tree nodes as structures rather than classes. I'm still on the fence whether to reimplement the tree objects themselves as structures.
- Functional programming can be a great substitute for generic
programming. The previous version of TREES tried to be
too clever, handling insertion into an empty tree as a special case
and delegating a good chunk of work to a method called TREE-INSERT-NONEMPTY or somesuch, which handled rebalancing. A
similar tack was done for deletion. I didn't understand my intentions
for TREE-INSERT-NONEMPTY after coming back to the code, so
I started cutting and pasting, hoping idioms would jump out at me after
I had done two or three binary tree variants.
And they did. What emerged was having generic code handle insertion and deletion and then called a passed-in function to handle rebalancing. It's quite a bit cleaner and more understandable than doing things with methods, in my opinion.
I still have things to fix; I took out DO-TREE-RANGE and WITH-TREE-ITERATOR because my code didn't need them, but they should probably go back in--especially WITH-TREE-ITERATOR. But I wanted to get the code out the door; two years between releases is long enough!
posted by Nate @ 8:44PM