My last paragraph says you're probably better off with a solution that doesn't achieve the same density but avoids the uncorrelated accesses. In other words, stick to some form of local probing and take the hit in density.
Why would TLB misses be small compared to RAM latency? A TLB miss must be serviced by reading more memory, and misses aren't handled in parallel, unlike random reads to RAM. Sure, you could use very large (1G) pages, but that's a pretty specialised setup that's not available on every platform and tends to require a reboot to enable/tweak. Not something we want to rely on in general.
Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed. That's particularly problematic with the usual interfaces that don't let the hash table specify a seed to the hash function and expect a machine word: we have to map values to hashes, and remix or split the hashes (in theory, that's defensible as long as the intermediate hash is strong and its codomain is at least (hash set size)^2), but there's nothing to remix away if we have too many values that map to the same intermediate hash. If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.
Simple deterministic probing technique just take a hit with a bigger cluster than expected; a performance degradation, but the table still work. You can also see theoretical analyses that'll lead you to similar conclusions if you look at the k-independence needed for each hashing technique.
Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line. You don't need SIMD instructions to expose memory level parallelism, you just need independent dependency graphs in your serial computation (until you're bottlenecked on an execution resource that's not duplicated and rarely pipelined, like the TLB miss logic).
My last paragraph says you're probably better off with a solution that doesn't achieve the same density but avoids the uncorrelated accesses. In other words, stick to some form of local probing and take the hit in density.
Thanks, I was misreading.
Why would TLB misses be small compared to RAM latency?
Because for recent CPU's (post-P5 for Intel) the page walks to service a TLB miss use the standard data caching mechanisms, thus for a frequently used hash table that is reading only a couple cachelines per lookup, the page tables usually remain in cache: http://electronics.stackexchange.com/a/67985.
So while the TLB miss requires a lookup, this lookup frequently doesn't require hitting RAM. My recollection is that this means a TLB miss usually costs only the relevant cache miss plus ~10 cycles. But this does require certain assumptions about the access pattern, and I've been meaning to retest this on recent hardware to be sure.
Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed.
Yes, although if you can choose a good hash function this should be rare. And there are variations of cuckoo hashes that are much less susceptible to this. The first either increases the number of hashes (d-ary), and the second adds multiple "bins" as described by 'cmurphycode' in another comment. Then you can add a "failsafe" by adding a "stash" of last resort: https://www.eecs.harvard.edu/~michaelm/postscripts/esa2008fu...
If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.
If you can choose your own hash function, the hashing cost should be minimal even for a "perfect" hash. And a SIMD approach usually means that you can create 2, 4, or 8 hashes using different seeds in exactly the same time that you can create a single hash: http://xoroshiro.di.unimi.it/xoroshiro128plus.c
Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line.
Why would TLB misses be small compared to RAM latency? A TLB miss must be serviced by reading more memory, and misses aren't handled in parallel, unlike random reads to RAM. Sure, you could use very large (1G) pages, but that's a pretty specialised setup that's not available on every platform and tends to require a reboot to enable/tweak. Not something we want to rely on in general.
Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed. That's particularly problematic with the usual interfaces that don't let the hash table specify a seed to the hash function and expect a machine word: we have to map values to hashes, and remix or split the hashes (in theory, that's defensible as long as the intermediate hash is strong and its codomain is at least (hash set size)^2), but there's nothing to remix away if we have too many values that map to the same intermediate hash. If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.
Simple deterministic probing technique just take a hit with a bigger cluster than expected; a performance degradation, but the table still work. You can also see theoretical analyses that'll lead you to similar conclusions if you look at the k-independence needed for each hashing technique.
Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line. You don't need SIMD instructions to expose memory level parallelism, you just need independent dependency graphs in your serial computation (until you're bottlenecked on an execution resource that's not duplicated and rarely pipelined, like the TLB miss logic).