why is this still the case ?

rwmj · 2025-06-12T11:40:25 1749728425

I've been checking in large (10s to 100s MBs) tarballs into one git repo that I use for managing a website archive for a few years, and it can be made to work but it's very painful.

I think there are three main issues:

1. Since it's a distributed VCS, everyone must have a whole copy of the entire repo. But that means anyone cloning the repo or pulling significant commits is going to end up downloading vast amounts of binaries. If you can directly copy the .git dir to the other machine first instead of using git's normal cloning mechanism then it's not as bad, but you're still fundamentally copying everything:

  $ du -sh .git
  55G .git

2. git doesn't "know" that something is a binary (although it seems to in some circumstances), so some common operations try to search them or operate on them in other ways as if they were text. (I just ran git log -S on that repo and git ran out of memory and crashed, on a machine with 64GB of RAM).

3. The cure for this (git lfs) is worse than the disease. LFS is so bad/strange that I stopped using it and went back to putting the tarballs in git.

suriya-ganesh · 2025-06-12T18:19:27 1749752367

This is a problem that occurs across game development to ML datasets.

We built oxen to solve this problem https://github.com/Oxen-AI/Oxen (I work at Oxen.ai)

Source control for large data. Currently our biggest repository is 17 TB. would love for you to try it out. It's open source, so you can self host as well.

dh2022 · 2025-06-12T16:22:00 1749745320

Why would someone check binaries in a repo? The only time I came across checked binaries in a repo was because that particular dev could not be bothered to learn nuget / MAVEN. (the dev that approved that PR did not understand that either)

masklinn · 2025-06-12T16:39:03 1749746343

Because it’s way easier if you don’t require every level designer to spend 5 hours recompiling everything before they can get to work in the morning, because it’s way easier to just checkin that weird DLL than provide weird instructions to retrieve it, because onboarding is much simpler if all the tools are in the project, …

And it’s no sweat off p4’s back.

dh2022 · 2025-06-12T17:02:02 1749747722

Hmm, I do not get it.... "The binaries are checked in the repo so that that the designer would not spend 5 hours recompiling" vs "the binaries come from a nuget site so that the designed would not spend 5 hours recompiling".

In both cases the designer does not recompile, but in the second case there are no checked in binaries in the repo... I still think nuget / MAVEN would be more appropriate for this task...

masklinn · 2025-06-12T17:27:51 1749749271

Everything is in P4: you checkout the project to work on it, you have everything. You update, you have everything up to date. All the tools are there, so any part of the pipeline can rely on anything that's checked in. You need an older version, you just check that out and off you go. And you have a single repository to maintain.

VCS + Nuget: half the things are in the VCS, you checkout the project and then you have to hunt down a bunch of packages from a separate thing (or five), when you update the repo you have to update the things, hopefully you don't forget any of the ones you use, scripts run on a prayer that you have fetched the right things or they crash, version sync is a crapshoot, hope you're not working on multiple projects at the same time needing different versions of a utility either. Now you need 15 layers of syncing and version management on top of each project to replicate half of what just checking everything into P4 gives you for free.

nyarlathotep_ · 2025-06-12T22:15:41 1749766541

> VCS + Nuget: half the things are in the VCS, you checkout the project and then you have to hunt down a bunch of packages from a separate thing

Oh, and there's things like x509/proxy/whatever errors when on a corpo machine that has ZScaler or some such, so you have to use internal Artifactory/thing but that doesn't have the version you need or you need permissions to access so.. and etc etc.

dh2022 · 2025-06-12T18:54:20 1749754460

I have no idea what environment / team you worked on but nuget is pretty much rock solid. There are no scripts running on a prayer that everything is fetched. Version sync is not a crapshot because nuget versions are updated during merges and with proper merge procedures (PR build + tests) nuget versions are always correct on the main branch.

One does not forget what nugets are used: VS projects do that bookkeeping for you. You update the VS project with the new nugets your task requires; and this bookkeeping will carry on when you merge your PR.

I have seen this model work with no issues in large codebases: VS solutions with upwards of 500,000 lines of code and 20-30 engineers.

tom_ · 2025-06-13T00:11:48 1749773508

But if you have to do this via Visual Studio, it's no good for the people that don't use Visual Studio.

Also, where does nuget get this stuff from? It doesn't build this stuff for you, presumably, and so the binaries must come from somewhere. So, you just got latest from version control to get the info for nuget - and now nuget has to use that info to download that stuff?

And that presumably means that somebody had to commit the info for nuget, and then separately upload the stuff somewhere that nuget can find it. But wait a minute - why not put that stuff in the version control you're using already? Now you don't need nuget at all.

rwmj · 2025-06-12T17:55:07 1749750907

Because it's (part of) a website that hosts the tarballs, and we want to keep the whole site under version control. Not saying it's a good reason, but it is a reason.