I'm working on a video / post on how to solve the 1 billion row challenge (https://github.com/gunnarmorling/1brc) and get a competitively fast result while keeping the code readable and maintainable.
So far I'm within spitting distance of the winning entries without using any unsafe code or bit twiddling tricks or custom JVMs or anything like that, and having all the concerns nicely separated and modularized.
Counterthoughts: a) These skills fit on a double sided sheet of paper (e.g. the claude code best practices doc) and b) what these skills are has been changing so rapidly that even the best practices docs fall out of date super quick.
For example, managing the context window has become less of a problem with increased context windows in newer models and tools like the auto-resummarization / context window refresh in claude code make it so that you might be just fine without doing anything yourself.
All this to say that the idea that you're left significantly behind if you aren't training yourself on this feels bogus (I say this as a person who /does/ use these tools daily). It should take any programmer not more than a few hours to learn these skills from scratch, with the help of a doc, meaning any employee you hire should be able to pick these up no problem. I'm not sure it makes sense as a hiring filter. Perhaps in the future this will change. But right now these tools are built more like user friendly appliances - more like a cellphone or a toaster than a technology to wrap your head around, like a compiler or a database.
In some contexts, dictionary encoding (which is what you're suggesting, approximately) can actually work great. For example common values or null values (which is a common type of common value). It's just less efficient to try to do it with /every/ block. You have to make it "worth it", which is a factor of the frequency of occurrence of the value. Shorter values give you a worse compression ratio on one hand, but on the other hand it's often likelier that you'll find it in the data so it makes up for it, to a point.
There are other similar lightweight encoding schemes like RLE and delta and frame of reference encoding which all are good for different data distributions.
This is my time to shine - I know the cause of this mistake. Like the article mentions, international trade is specified using the HS (Harmonized System) encoding mechanism.
Now, product groups for which data is most frequently and easily available is the 4-digit level, which is quite broad. If you look at the code 3002 in the HS classification system (of which there are many versions but we'll ignore that for now), you'll find a category, succinctly named:
> "Human blood; animal blood prepared for therapeutic, prophylactic or diagnostic uses; antisera, other blood fractions and immunological products, whether or not modified or obtained by means of biotechnological processes; vaccines, toxins, cultures of micro-organisms (excluding yeasts) and similar products; cell cultures, whether or not modified:"
People new to trade data, especially programmers, with some hubris, tend to think this is way too long a category name to fit in a title or dropbox, so they chop it at the semicolon and call it good, resulting in "Human Blood" or similar. Better data sources tend to shorten these based on the real world percentage of the subcategories, e.g. see here "Serums and vaccines":
God I loved that lobby and the art deco + mexican combination art style. I found a high res version of that mural as a wallpaper at some point but am coming up short for the link right now.
It's more that the "negative nancies" became necessary nancies. Back when Amazon sold books, they became a considerable player but otherwise big whoop. Now they threaten to dominate logistics AND hosting, and are expanding their grip and stamping out competition in other markets. Google is pretty much synonymous with the web. Meta owns a big chunk of messaging and social media. Computers used to not matter much but now we're glued to one
It costs even more to be reckless today.
Re: "whitey on the moon" - I'm not sure the space program would be my first target there but I think it makes a more poetic contrast and forces people to pay attention by targeting a beloved cultural narrative. Cyberpunk - by my reckoning a bit later - has been preaching a very similar message of massive inequality in the presence of incredible technology and wealth disparity and power concentration. And yet that doesn't draw the same ire. I guess in that case it's easier to dismiss the core message because robot limbs and cool neon lights are too much of a distraction.
You might enjoy the excellent Articles of Interest podcast, an episode of which covers this exact phenomenon, but there are many other great episodes about similar subjects in clothing and fashion
Especially for a first time in all of humanity type of mission, half a century ago, which yielded brand new data on faraway objects we'd never had, and considering it's still going and reporting data, it's arguably a bargain basement price for such a thing.
This kind of post is what brings me back to this website :-)
I'm the guy with the enthusiastic thread earlier on in this post. I'd love to sit down and chat with you for an hour on zoom and hear all about those times, which we could then we could post the video on here - I think people would appreciate.
I have absolutely zero experience in interviewing people, nor do I have a media channel of any kind, but I promise I'd do my best to ask interesting questions. If that sounds interesting, shoot me an email (you can find it in my profile).
So far I'm within spitting distance of the winning entries without using any unsafe code or bit twiddling tricks or custom JVMs or anything like that, and having all the concerns nicely separated and modularized.
Excited to share soon!