Hash Table Shootout!
It’s gratifying to see GHashTable faring not-too-badly against an assortment of other hash table implementations in Nick Welch’s excellent benchmarks. I did some work on this previously, but there’s at least one thing that can still be done to reduce its memory footprint on 64-bit architectures. The table is a big array of structs that look like this:
struct _GHashNode
{
gpointer key;
gpointer value;
guint key_hash;
};
The key_hash is useful, as it lets us skip over mismatches and resize the hash table quickly - and on a 32-bit arch, it adds only 4 bytes to every entry, for a total of 12 bytes. However, on a 64-bit arch, it causes entries to be padded so that each entry starts on a multiple of 8 bytes. That wastefulness can be remedied by packing key_hash values in a separate array - as a bonus, entry size becomes a power of two, which means offsets into the array can be calculated using a single shift. On the downside, we’ll incur the added overhead of maintaining two arrays and accessing an extra cache line for each lookup. I suspect it’s still worth it, though.
A couple of open questions/comments on the benchmarks themselves:
- Google’s dense_hash_map is suspiciously fast at integer deletion - is it shrinking the hash table at all, re-inserting the remaining items as it should? Or does it have some other trick up its sleeve?
- How robust are the different strategies with respect to poorly distributed keys? E.g. N*2^M spaced integers: 1024, 2048, 3072, 4096, etc.
- How about poor hash functions supplied by the API user? GHashTable attempts to tackle this by calculating the initial array index modulo a prime number, before applying a quadratic modulo on subsequent probes (faster, and required for quadratic probing).
- What’s the per-table overhead? How much memory will, say, some 100.000 tables with <100 items each consume? This is not uncommon in practice - nested hashes are often used to store complex properties, and given suitably large working sets, this can become a significant factor for some programs.
- Are the approaches affected by memory fragmentation? Do they cause it? This is hard to measure; maybe many tables could be grown simultaneously in the same address space.
Nice report indeed, and great to see glib’s hash tables are top ranking…
… but (yes, there’s always a but)… that guy didn’t compare the overwhelming usage which is done of hash tables, that is searching for entries. Various parts of GObject make heavy use of it (GDataList, GParamSpec), so does evolution.
I’d love to see some reports/pointers/ideas on how retrieval could be speeded up (if possible).
Edward: Both insertion and deletion also imply a search operation (to find the element that is being inserted or deleted). But you’re right in that maybe searches should be weighted differently relative to the other operations.
You ask if google’s hash tables actually shrink the table on deletion: the answer is no. But it is shrinked on first insertion after all deletes. Smart isn’t it?
Quoted from http://google-sparsehash.googlecode.com/svn/trunk/doc/implementation.html :
The “too empty” value is not necessary for performance but helps with space use. It’s rare for hashtable implementations to check this value at insert() time — after all, how will inserting cause a hashtable to get too small? However, the sparse_hash_set implementation never resizes on erase(); it’s nice to have an erase() that does not invalidate iterators. Thus, the first insert() after a long string of erase()s could well trigger a hashtable shrink.
Actually, as long as glib uses void* for generic typing, it will suffer performance problems, both in speed and in memory, when the key is simple (e.g. an integer), because glib has to waste the space for a pointer and retrieving the value from the pointer also causes overhead. If performance is a concern, glib is not a good choice.
attractivechaos: If you’re storing e.g. 32-bit integers on a 64-bit platform, that’s true. GHashTable is a generic implementation that allows you to hash complex data structures if you wish; you could always write a faster, specialized implementation for a single data type.
In fact, once you’re rolling your own, you might as well compile it directly into your application and inline function calls as appropriate. I expect that’d make it faster still.
However, most C application developers prefer not to spend their time on that, and the numbers show that for them, GHashTable strikes a very acceptable balance between performance and convenience.