Reference Counting in Raven

All programs work with information. To the programmer, data takes many forms: a list, a table, an image, an essay, an album. To the computer, it all looks basically alike: it’s just a bunch of bytes in memory. Imagine a single column of a spreadsheet, each of the cells filled with a number from 0 to 255. All your stuff is scattered across that column, along with bookkeeping to track what lives where. And every interaction with the computer goes through that memory. Pressed a key on your keyboard? Stick a 97 in cell 5,342,387,264, please.

The basic bookkeeping is handled by a program called a “malloc” (for memory allocator). Malloc marks parts of the column as available or in use (using more entries in the same column!). If you need, say, 10 bytes of scratch space for something, call ptr = malloc(10), and receive a pointer to those ten bytes – an index into the column. If the pointer is row 1000, you may put whatever you please into cells 1000 to 1009 and read them back later. When you’re done, call free(ptr) and that region can be re-used somewhere else.

In a low level language like C, this is all you get. It’s still pretty far from dictionaries and tables, and if you want those things you have to build them up yourself using pointers that point to other pointers. High level languages hide those details: a Python list conceptually contains other Python objects, not pointers telling you how to find those objects, even if that’s how the list works internally. Allocating memory for objects is pretty easy, we can just call malloc during creation. The hard part is knowing when to free that memory: you don’t tell Python “hey I’m done with this list”, you just stop using it at some point, and someone has to clean up after you.

(It’s possible but dangerous to free everything yourself. If you forget to release an object – or use one you said you wouldn’t – the consequences can be hard to trace. The vast majority of serious security bugs are faults of this kind. Providing protection more-or-less amounts to the same problem as automatic cleanup. So high level languages just track objects for you.)

There are basically two major approaches to this problem: (tracing) garbage collection, or GC, and reference counting, or RC.^[1] GC is like having a butler visit you every day to pick up empty crisp packets and dirty laundry: he’ll start by freezing you in action (so you can’t grab something he’s cleaning), make a note of everything you’re actively using (so as not to interrupt your work) and tidy everything else away. That’s called “mark and sweep”. Aside from the occasional pauses – which modern GCs make hardly noticeable – you are free to litter as you please with little impact. RC, in contrast, is more like clearing up as you go along, taking a trip to the bin each time your packet empties.

Naive RC has its downsides: clearing everything in one batch can be quicker than doing lots of small tasks. Nonetheless RC feels like the better premise to me, because it’s so much more resource-efficient. Imagine you only have two mugs in your kitchen: under GC, you’ll either have to invite your butler over more often or stick to two drinks a day. If you instead clean up immediately you can have as many as you please. RC also lets you reuse things in-place. Finished your drink and want a new one? Skip cleaning and reuse the cup! This is helpful for a functional language which would otherwise make a lot of copies. (The more typical approach of explicit mutation/reuse is arguably a kind of manual memory management, and comes with analogous problems – but there’s a topic for another time.)

Developers are used to treating memory as infinite – even phones regularly have over eight billion cells in their spreadsheet – but computing has plenty of coffee cups that must be used sparingly, including file handles and GPU memory. Because GC handles these poorly, many languages fall back to unreliable manual management. This seems a poor state of affairs: I’m not convinced CPU memory is such a special case, and don’t think a resource-management solution that only works for one kind of resource is truly solving the problem.

Despite its elegance, RC gets a bad rap, often spoken about almost as a failed approach. I think this is mostly for historical reasons. RC is an undeniably poor fit for the object-oriented vogue of the nineties, which inspired the languages still used by the vast majority of programmers today: Java, JavaScript, Python, C#. Those systems encourage lots of short-lived allocations for small objects (say, complex numbers), which exacerbates any bookkeeping overhead. Python’s performance has been hamstrung by its use of RC, and it is notably the slowest of the bunch (though it also shows the benefits: Python became the language of machine learning partly because it better suits GPU memory).

So GCs had a head start, and got steadily better as companies worked around the language problems. For example, modern collectors are often “generational”, which means a region called the “nursery” is reserved for fast allocation of all those small objects, in effect treating them as a unit. Since these baby allocations are short-lived, you can destroy the whole nursery at once rather than individually… ok, hold on, can I have a quick chat with whoever chose this metaphor? But you get the idea, it’s faster.

The point is that GC’s advantage isn’t necessarily inherent: critics compare a concurrent, generational, tri-colour, moving, mark-and-sweep collector to the kind of RC you’d make in school. Naive GC is not that great either! The burden of proof is of course on RC proponents, but I think we can do a lot better.

For example: those OO languages are typically a poor fit for static analysis, but even if they weren’t, it would be little help to GC. Conversely, Raven is designed for analysis, and that’s useful for RC. That starts with inferring types and stack-allocating most objects (you’ll spend less time clearing up if you’re less of a litterbug to begin with). More generally we can make pretty good guesses about object lifetimes, and use that to do some of the bookkeeping at compile time. Think of Rust’s borrow checker, akin to a compile-time RC where the count cannot exceed 1.^[2] You can imagine making this check implicit, and where it fails you fall back to RC (more or less as Rust programmers do by hand anyway).

This approach is explored by a language called Lobster, by fellow GC-basher Wouter van Oortmerssen (“If I’d have to make a top-10 of ‘least elegant algorithms in computer science’, the #1 would be an easy pick”). Lobster implemented a relatively simple ownership heuristic which removed 95% of runtime counts. Ownership annotations then allow for a stricter Rust-like experience where needed. There are still some thorny problems: Lobster specialises functions on ownership status, which might lead to code blowup, and optimises for as many borrowed references as possible (since they don’t need counting at all) when this may make mutation optimisations harder (you need to prove something owned and not borrowed in order to modify it, just as Rust makes explicit). But it seems promising.

The other Achilles’ heel of RC is multithreading. Bumping the count means reading the current value, adding 1, and then writing it to memory again. If you have multiple threads doing this at once, you can end up finding that 2 + 2 = 3. To avoid this naive RC uses atomic operations, slowing things down further. But you can use a technique that this paper calls “biased reference counting”, in which the count is split between the owning thread (unsynchronised) and all others (synchronised). Since most objects live primarily on a single thread, this significantly reduces the number of atomic operations, resulting in much better performance. The paper Down for the Count explores other relatively simple techniques that can make RC competitive with tracing GC.

Hybrid approaches are possible too. Wasm is getting GC support which inherits from the already-excellent JavaScript VMs. If throughput is more important than memory overhead, you can use GC allocation and collection, with refcounts only for expensive resources (eg file handles) and for opportunistic mutation. Raven’s type system also gives us a precise idea of what can appear in a data structure, unlike typical OO languages. We could reserve counts for data structures containing file handles (I/O being less performance sensitive anyway) but lean entirely on the GC for numeric arrays, for example. Combined with other performance tooling – eg explicitly-mutable data structures – it should always be possible to reach peak performance in program hotspots, if the compiler can’t do it for you.

This is all relatively vague. Raven’s memory management design is far from fully realised. As it stands, the language uses both fairly naive refcounting and the simplest malloc possible, because getting the basics working is more important than winning microbenchmarks. On the other hand, “we’ll fix it later” is a dangerous statement in system design, and even more so when it comes to performance, so it pays to consider these things early on. I think there is good reason to believe that reference counting will work for Raven – and at the least it’s well worth exploring.

I know some people treat RC as a kind of GC. I’ll refer specifically to the tracing kind, and use “(automatic) memory management” if I need to generalise. ↩︎
Naive RC would count both owners and borrows, which is overly restrictive. But proven borrows can be ignored, leaving only owners in the count. ↩︎

I write regularly about things like this my newsletter, which you can get by sponsoring me. Or, sign up for the free updates: