Nova really made the perfect product. I remember paying for the premium version worth ₹400 or something back in 2014. I used it till 2023 after I switched to iPhone as my daily driver.
Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
You can parse JSON at several GB/s: https://github.com/simdjson/simdjson
And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.
As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.
Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity
But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
IDK its been pretty solid (but it does mess up) which is where I come in. But it has helped me work with Databricks (read/writing from it) and train a model using it for some of our customers, though its NOT in prod.
Would Hackernews community allow for something like this or be interested in doing this or say, if I were to create this post (or perhaps the OP) every month, would that go against terms or still be allowed.
I think it can be allowed but still just want to confirm if the community really wants this
I saw an aspect of vulnerability in hackernews I hadn't seen prior which made things feel real atleast to me
I myself am a DVD enthusiast (in so far as I have a copy of TDK trilogy and Raimi trilogy plus a few other classic movies/shows and songs from the 00s). There are a few shows that I enjoyed as a teen and the fact is I no longer have a way to even legally watch them in my country, so for me the ability to never lose those movies despite streaming platforms being around is the main motivator. (However I do not have a functional DVD player anymore which sucks).
So I think lets not shame people for what they do on their own time that affects none of us really.
Umm yes? The metro even if not a big deal in the states is like a small but quiet way it has changed public transport, plus moving freight, plus people over large distances, plus the bullet train that mixed luxury, speed and efficiency onto trains, all of these are quietly disruptive transformations, that I think we all take for granted.
reply