» vm tricks

Friday, 1 February A.D. 2008 @ 10:28 AM

I'd like to share two nifty language implementation tricks I've come across recently. The first is about doing allocation safely and quickly. It's helpful to have read Atomic heap transactions and fine-grain interrupts as background, but the first paragraph of the abstract sets out the issues, so I'll quote it here:

Languages such as Java, ML, Scheme, and Haskell provide automatic storage management, that is, garbage collection. The two fundamental operations performed on a garbage-collected heap are “allocate” and “collect.” Because the heap is in an inconsistent state during these operations, they must be performed atomically. Otherwise, a heap client might access the heap during a time when its fundamental invariants do not hold, corrupting the heap.

When the paper talks about “atomic” operations, they aren't talking about the usual atomic memory references accomplished by low-level CPU instructions. They're talking about atomic-ness with regard to interrupts and such--if you're in the middle of an allocation sequence, and you receive an interrupt and attempt to run a handler that wants to do some allocation of its own, you're going to be in trouble. So you have to find some way to make the first allocation uninterruptible. (C chooses to punt on this; you're simply expected to not call certain functions--such as malloc--while you're in signal handlers. I think having conventions like that can be appropriate in other contexts, but asking people to avoid allocation in a high-level GC'd language is a bit harder.)

One way to do this is the way SBCL does it. A bit is set prior to beginning an allocation; call this the pseudo-atomic bit. The allocation then takes place in the usual pointer-bumping manner. After the allocation has been performed, GC'ing if necessary, and the object has been properly initialized, the pseudo-atomic bit is cleared and another bit, the pseudo-atomic interrupted bit is checked. This bit is set by the C runtime if a signal comes in while the pseudo-atomic bit is set; information about the signal is then saved for later use.

If the pseudo-atomic interrupted bit is set at the end of the allocation sequence, then we need to handle all the pending signals that have accumulated while we were doing the allocation. You can see this come up if you use SB-SPROF; a good number of samples will probably point at the instruction at the end of a pseudo-atomic sequence, because that's when the profiling signal finally got handled.

Code is helpful here. A typical allocation for a cons cell on x86 SBCL looks like:

; 0ACCD727:       800DB403100104   OR BYTE PTR [#x11003B4], 4
;       2E:       B808000000       MOV EAX, 8
;       33:       030544AB0708     ADD EAX, [#x807AB44]       ; boxed_region
;       39:       3B0548AB0708     CMP EAX, [#x807AB48]
;       3F:       7607             JBE L0
;       41:       E8563F39FD       CALL #x806169C             ; alloc_overflow_eax
;       46:       EB09             JMP L1
;       48: L0:   890544AB0708     MOV [#x807AB44], EAX       ; boxed_region
;       4E:       83E808           SUB EAX, 8
;       51: L1:   8D4003           LEA EAX, [EAX+3]
;       54:       8035B403100104   XOR BYTE PTR [#x11003B4], 4
;       5B:       7402             JEQ L2
;       5D:       CC09             BREAK 9                    ; pending interrupt trap
;       5F: L2:   

Let's step through this:

; 0ACCD727:       800DB403100104   OR BYTE PTR [#x11003B4], 4

We first set the pseudo-atomic bit.

;       2E:       B808000000       MOV EAX, 8
;       33:       030544AB0708     ADD EAX, [#x807AB44]       ; boxed_region

We need to allocate 8 bytes for a cons cell, so we load up 8 and add it to some memory address. This memory address corresponds to the C symbol boxed_region, which is a structure that describes an allocation region inside SBCL. The first member is the allocation pointer and the second member is the end of the allocation region. EAX now holds the address of the end of the object, which must not lie outside of the nursery...

;       39:       3B0548AB0708     CMP EAX, [#x807AB48]
;       3F:       7607             JBE L0
;       41:       E8563F39FD       CALL #x806169C             ; alloc_overflow_eax
;       46:       EB09             JMP L1

...which is precisely what these instructions handle. We compare EAX to the end of the allocation region and call into the garbage collector if necessary. After the garbage collector has been called, the runtime will ensure that 8 bytes are allocated to EAX, which means that we can skip over the next bit of code:

;       4E:       83E808           SUB EAX, 8
;       51: L1:   8D4003           LEA EAX, [EAX+3]

Remember, EAX held the address of the end of the object, so we have to subtract off 8 bytes to get back to the beginning of the object. It'd be nice if we could combine the SUB and the LEA, but that's not possible, since the LEA is a jump target. What's the LEA doing, anyway? It's tagging the pointer in EAX as a cons cell; discussing tags in SBCL is outside the scope of this blog entry.

;       54:       8035B403100104   XOR BYTE PTR [#x11003B4], 4
;       5B:       7402             JEQ L2
;       5D:       CC09             BREAK 9                    ; pending interrupt trap
;       5F: L2:   

Finally, the XOR clears the pseudo-atomic bit and checks via flags whether the pseudo-atomic interrupted bit was set. If it was, then we need to trap and handle pending signals; otherwise, we can proceed with normal execution.

If you've read about how fast allocation can be in garbage collected languages, you may be thinking that this looks pretty slow (although it's still shorter than the fast path through malloc, I'd imagine). You've got multiple memory reads and writes, a couple conditional jumps...it looks pretty ugly. The situation is better on register-rich architectures where the allocation pointer is held in a register, but there's quite a bit of overhead.

However, there are two advantages of this scheme: signals can be handled at more-or-less arbitrary points in the program, so there's no checking for safepoints and the latency for handling signals is minimized. The latter property comes in particularly handy for dealing with native threads; stopping the world for garbage collection is handled entirely with signals. I'm not entirely sure if this is the right way to go about things--the signal handling code in the runtime is fairly complicated and allocations are slower than they could be--but that's the way the system is.

A different way of doing it is shown by the OCaml runtime. Its allocation sequence on x86 for 8 bytes is much shorter. From asmrun/i386.S in the OCaml source tree:

G(caml_alloc1):
        PROFILE_CAML
        movl    G(caml_young_ptr), %eax
        subl    $8, %eax
        movl    %eax, G(caml_young_ptr)
        cmpl    G(caml_young_limit), %eax
        jb      LBL(100)
        ret

This is quite a bit shorter! This is your basic pointer bumping algorithm; caml_young_ptr is the allocation pointer and caml_young_limit serves as the end of the nursery. If we overflow the nursery, we jump to LBL(100), which calls into the garbage collector and then retries the allocation But there are two tricks that contribute to making this shorter than SBCL's allocation sequence.

The first trick employed here is that OCaml organizes the nursery slightly differently. SBCL's nursery grows from low to high addresses, which requires some extra legwork to determine whether an allocation overflows the nursery or not. OCaml's, on the other hand, grows from high to low addresses and requires less fiddling. (This is similar to why you want to try to write loops that count down towards 0, rather than count up towards N.)

But our experiences with SBCL and signals should have us asking: what about atomicness? What happens with signals here? And this leads us to our second trick: when a signal arrives, OCaml records whatever information it needs to handle the signal and then sets caml_young_ptr to caml_young_limit. So the next allocation that comes along will find that there is no space left in the nursery and call into the garbage collector, which also handles processing any deferred signals.

It is left as an exercise for the reader to verify that signals occurring during the allocation sequence do not corrupt the heap.

According to the terminology of the paper, OCaml uses safepoints; the end of every allocation is a safepoint. Allocations don't have to worry about pseudo-atomic, cutting instructions, and since OCaml is a mostly-functional language, allocations should occur fairly often, thus delivering reasonable latency on handling signals. The disadvantage is that signals are going to be delayed, so tricks like interrupt-driven profiling need to be baked into the runtime, rather than being written directly in OCaml.

Obviously, the authors of OCaml and the authors of SBCL (and CMUCL) envisioned their systems being used for very different things. CMUCL was probably designed to do all sorts of grotty systems programming directly and therefore required a system that could handle interrupts at virtually any instruction with minimal latency. OCaml's designers probably expected it to be used less for systems programming and more for application development, so they optimized their runtime along those lines instead.

It is left as an exercise for the reader to convert SBCL to use OCaml's system. Doing OCaml's first trick of organizing the nursery differently should be straightforward; doing the second trick of removing pseudo-atomic somewhat less so. I pointed out above that SBCL's thread support relies heavily on signals. Since signals could be delayed indefinitely with the OCaml scheme (more likely in Common Lisp than in OCaml), something needs to be done to ensure that all threads rendezvous for garbage collection in a timely manner.

This observation leads us to the second implementation trick.

I mentioned that OCaml uses safepoints. These are sometimes called yieldpoints, most notably in the Jikes RVM (nee Jalapeno) system. The idea is similar, but instead of waiting for the next allocation, which could be delayed indefinitely, put a check at every backward branch and function return to see if a signal has arrived. We will always hit one of these fairly quickly--even in a non-allocating loop. The heap's state is also guaranteed to be consistent at the point and we can safely run the garbage collector, signal handlers, etc.

Reading about this several years ago, I thought, “You're going to add an extra memory read and a conditional jump to every backwards branch? This is going to kill performance!” But many language implementers think this is a reasonable strategy. (The compiler, could, of course, choose to elide the yieldpoint checks in speed-critical code.) A number of native code JVMs use this strategy. I believe SML/NJ used to use this system, although for a different reason: at backwards branches, it recorded which registers contained pointer data to assist the garbage collector in discovering roots.

I discovered that the HotSpot Java compiler uses yieldpoints today, while reading an excellent blog post. But HotSpot applies a small optimization; a yieldpoint at the end of an infinite loop looks like (on x86-64):

28C26E6   test         dword ptr [160000h],eax
28C26EC   jmp          28C2690

Wait, wait, where's the comparison to see if we need to go handle signals? I'll let the aforementioned post explain:

This seemingly useless test is part of a mechanism used by the VM to suspend the thread at this instruction (a safepoint). When the VM wants to suspend all threads (for a GC) it unmaps the safepoint polling memory page (in this case at 0x160000) and waits for all threads to suspend. Each thread running compiled Java code will eventually run this instruction and cause a page fault, inside the page fault handler it is detected that a safepoint thread suspend is requested and the thread calls the VM to suspend itself.

What initially looked to be a test and branch (and possibly more code to properly jump into the signal dispatcher) added to every backwards branch has been reduced to a single test through the wonder of virtual memory and page protection. Very cute.