My last paragraph says you're probably better off with a solution that doesn't a...

nkurz · on Sept 1, 2016

My last paragraph says you're probably better off with a solution that doesn't achieve the same density but avoids the uncorrelated accesses. In other words, stick to some form of local probing and take the hit in density.

Thanks, I was misreading.

Why would TLB misses be small compared to RAM latency?

Because for recent CPU's (post-P5 for Intel) the page walks to service a TLB miss use the standard data caching mechanisms, thus for a frequently used hash table that is reading only a couple cachelines per lookup, the page tables usually remain in cache: http://electronics.stackexchange.com/a/67985.

So while the TLB miss requires a lookup, this lookup frequently doesn't require hitting RAM. My recollection is that this means a TLB miss usually costs only the relevant cache miss plus ~10 cycles. But this does require certain assumptions about the access pattern, and I've been meaning to retest this on recent hardware to be sure.

misses aren't handled in parallel

Based on your earlier phrasing you probably realize, but in case others don't, since Broadwell Intel CPU's do handle two page walks in parallel: http://www.anandtech.com/show/8355/intel-broadwell-architect....

Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed.

Yes, although if you can choose a good hash function this should be rare. And there are variations of cuckoo hashes that are much less susceptible to this. The first either increases the number of hashes (d-ary), and the second adds multiple "bins" as described by 'cmurphycode' in another comment. Then you can add a "failsafe" by adding a "stash" of last resort: https://www.eecs.harvard.edu/~michaelm/postscripts/esa2008fu...

If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.

If you can choose your own hash function, the hashing cost should be minimal even for a "perfect" hash. And a SIMD approach usually means that you can create 2, 4, or 8 hashes using different seeds in exactly the same time that you can create a single hash: http://xoroshiro.di.unimi.it/xoroshiro128plus.c

Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line.

They weren't any faster until Skylake, but they are significantly faster now: https://github.com/lemire/dictionary