More

anshumankmr · 2026-01-20T11:22:51 1768908171

Don't update? Could run for ages I feel.

anshumankmr · 2026-01-20T11:22:04 1768908124

Nova really made the perfect product. I remember paying for the premium version worth ₹400 or something back in 2014. I used it till 2023 after I switched to iPhone as my daily driver.

anshumankmr · 2026-01-19T09:26:18 1768814778

Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).

That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.

johndough · 2026-01-19T13:14:10 1768828450

    (>= 6 months of 1 gig of data per day)

You can parse JSON at several GB/s: https://github.com/simdjson/simdjson And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.

anshumankmr · 2026-01-20T05:17:16 1768886236

> https://github.com/simdjson/simdjson

Was not aware of this but seems it is not there natively in Python,but seems cool. Will try out in future.

jesse__ · 2026-01-19T17:43:26 1768844606

As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.

Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".

I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.

wongarsu · 2026-01-19T09:55:14 1768816514

If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity

But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work

ahoka · 2026-01-19T09:45:52 1768815952

“6 months of 1 gig of data per day”

Then you would need an enormous 2TB storage device. \s

anshumankmr · 2026-01-16T08:53:37 1768553617

IDK its been pretty solid (but it does mess up) which is where I come in. But it has helped me work with Databricks (read/writing from it) and train a model using it for some of our customers, though its NOT in prod.

anshumankmr · 2026-01-16T05:30:23 1768541423

Could have it be an automated monthly thing, like the who's hiring,who's wanting to be hired posts.

Imustaskforhelp · 2026-01-16T09:36:05 1768556165

Yes! I think this can work!

Would Hackernews community allow for something like this or be interested in doing this or say, if I were to create this post (or perhaps the OP) every month, would that go against terms or still be allowed.

I think it can be allowed but still just want to confirm if the community really wants this

I saw an aspect of vulnerability in hackernews I hadn't seen prior which made things feel real atleast to me

publicdebates · 2026-01-16T17:29:59 1768584599

I don't think an automated thing would work.

But I do think this thread is far too big to keep up with.

My plan was to post a similar but more focused thread in a month, and go from there.

anshumankmr · 2026-01-14T22:49:06 1768430946

https://www.anshumankumar.dev/

anshumankmr · 2026-01-08T03:53:20 1767844400

I present to you https://openai.com/sam-and-jony/

anshumankmr · 2026-01-05T04:43:01 1767588181

Alternatively rubber tracks also are great, if you have one nearby.

chakintosh · 2026-01-05T09:53:38 1767606818

Or just treadmills, I find them more gentle on my joints than concrete because it's slightly cushioned.

anshumankmr · 2026-01-02T06:36:33 1767335793

I myself am a DVD enthusiast (in so far as I have a copy of TDK trilogy and Raimi trilogy plus a few other classic movies/shows and songs from the 00s). There are a few shows that I enjoyed as a teen and the fact is I no longer have a way to even legally watch them in my country, so for me the ability to never lose those movies despite streaming platforms being around is the main motivator. (However I do not have a functional DVD player anymore which sucks).

So I think lets not shame people for what they do on their own time that affects none of us really.

anshumankmr · 2025-12-31T07:10:01 1767165001

Umm yes? The metro even if not a big deal in the states is like a small but quiet way it has changed public transport, plus moving freight, plus people over large distances, plus the bullet train that mixed luxury, speed and efficiency onto trains, all of these are quietly disruptive transformations, that I think we all take for granted.