Picture by Mike Bitzenhofer. License:
Because I’m just as cool as Luis.
For some time now, I’ve been considering the possibility of tightening the bolts on GHashTable, turning the idea over and over in my mind but not having had a chance to put it into practice. This can be pretty frustrating, since it’s hard for me to stop thinking about a salient idea like that without at least investing the work needed to convince myself that it’s not worth doing. It’s like an unidentified glimmer at the bottom of a lake. A silver hoard or an old bicycle? Probably the latter. But who knows?
So today I finally dove in and brought back some results.
First off, why optimize GHashTable? A hash table is a trivial data structure, so it should already be pretty optimal, right?
Let’s take a look.
As we know, any hash table will have collisions, where different keys hash to the same value. GHashTable resolves such collisions using chaining; each bucket points to the head of a linked list of elements that hash to that location. This is simple and robust, but it also has some disadvantages: One is poor locality of reference (oh, the locality!) and another is that you have to allocate and free memory, contributing to memory fragmentation over time. You also lose some memory to the extra pointers stored (one per item plus one per chain).
To improve on this, we can use open addressing: Instead of storing pointers to lists of items, we store the items in the table itself. Whenever there’s a collision, we use an algorithm to determine another suitable slot for the item, and if that slot is also occupied, we run the algorithm again until an empty slot is found. This is called probing and is obviously a more complicated business, considering that we want a) a relatively simple algorithm, b) a predictably low amount of probing and c) the ability to quickly remove items without affecting the probe sequences for other items in the table.
The simplest possible probing algorithm is called linear probing, and consists of just looking at the next slot in the table (modulo the table size) until you find what you’re looking for. Since this also has the best locality, it’s the first approach I tried. Unfortunately, due to clustering, the performance degrades too much before you reach a satisfactory load factor, even with good locality.
Quadratic probing is the next level of complexity and the one I settled with. Here, you use a quadratic polynomial to calculate the next location, which means you take bigger and bigger steps, avoiding clustering. This also affects cache performance somewhat, but considering that the few first steps are small, it’s still much better than chaining.
Since an element can be part of many probe sequences, you have to leave a special marker - commonly called a tombstone - when you remove it, so lookups don’t terminate prematurely. This isn’t a big problem, since tombstones can be reused, but you have to clean them up now and then so you don’t get too many - according to my benchmarks, this will almost never happen in GNOME apps, so shouldn’t affect performance much.
A bigger problem is that in order to guarantee that a probe will find an empty slot with load factors greater than 0.5, the table’s size has to be a power of two, which means the modulo operation will simply discard the high bits of the computed hash values. GHashTable currently uses prime-sized tables, which yields a better distribution with poor hash functions. On the other hand, taking a modulo of a power of two is a logical AND operation with a mask, which is much faster than the implied division required by prime modulos. Anyway - again, I couldn’t find any existing use cases where this caused performance problems.
Which brings me to the fact that this exercise would be meaningless without proper benchmarking. Now, I admit that due a lack of time and good tools, I haven’t been able to measure the effect on fragmentation, but it’s fairly obvious that allocating one big chunk of memory is better than one big chunk and a variable number of small chunks.
Apart from reducing fragmentation, which was my primary goal, it would be nice if speed and memory use wasn’t affected too badly. And that at least, I’ve measured.
Since synthetic tests are rarely representative, my test case was “Starting up Evolution offline with 3 gigabytes of local mail and 13 vfolders, waiting for it to finish setting up the vfolders, then quitting by closing the window with no interaction otherwise”. I measured memory use before quitting, using pmap, and total CPU time spent, using time. I realize you can’t repeat this test, it being tied to my local mail setup, but tests are rarely repeatable across computers anyway, since they have differing specs and configurations. Get the patch from the bug I filed, and run your own tests. I’d love to hear about your results!
Comparing the best out of 5 Evolution runs with old and new hash table code yielded the following numbers:
- Time on CPU (user+sys): 956.3s, down from 1023.1s (6.5% improvement)
- Private, writable memory: 623.5MiB, down from 645.8MiB (3.5% improvement)
- Heap size ([heap] mapping): 310.6MiB, down from 350.0MiB (11.3% improvement)
It looks modest, but considering hash tables account for only a small part of Evolution’s resource usage, I’m rather pleased with this. It cuts memory use of most GNOME processes by a slight amount - for instance, the GNOME calculator in scientific mode gave up 140KiB of private, writable memory. You may see greater returns in apps that use Glade, because it keeps widget lookup tables around.