Isn’t this just a trie with only two levels? It seems to me that this technique isn’t particularly unknown, as it’s taught in most elementary CS courses.
I think it would be fair to say that it's a kind of funny trie-hash-table hybrid. What's neat though is that it manages to achieve better space bounds than either a trie or a hash table would on their own.
I didn't invent the algorithmic idea --- it's basically a simplification of a data structure that was published in ICALP 2003 [1]. As far as I know, the idea of combining hash tables and tries in this kind of way (despite being incredibly simple!) doesn't appear in either the theoretical or experimental literature prior to that.
I'd be interested in any other earlier sources that people might have.
not quite, as the 2nd level is a hash table which is very un-trie-like. it shares the idea that you don't store the implicit bits known by the address of the first level like a trie.
Nope, there will just be 2^8 tables (constant overhead per table) each with 2^24 24-bit keys.
If you start with an array of any N keys, they have 32N bits of information. But once you insert them all into an unordered set like this, you throw away information by losing the order. There are only (2^32 choose N) unique sets of N keys, which is less than N*2^32 for N>1. This checks out in your example - there is only one unique set of all the 32-bit integers, so that should take log2(2^32 choose 2^32) = 0 bits to store sets of this size.
it's not mean for storing all possible 32 bit integers. if you want to do that, it would take zero storage because the answer is just "yes".
what they are doing is using the fact that a hash table can be loaded to 80% or 90% and still work fine. but they can encode part of the data (the first byte) implicitly in the hashed address. it's kind of cheating because the user provides those bits back to the lookup function to get their answer, so they don't need to be stored. and saving one byte out of four means you store 75% which is a bigger gain than the loss due to the 80%-90% load-factor.
say you have a normal hash table with 90% load factor, and you need to store 100,000 32 bit words, it would take 32 * 100,000 / 0.9 = 3555555.
but if you use this (or some related) trick and only save part of the key, you'd have 24 * 100,000 / 0.9 = 2666666. which is even less than 32 * 100,000 if you used direct storage in a simple array.