Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I Want Decentralized Version Control for Structured Data (jonas-schuermann.name)
160 points by goranmoomin on April 12, 2020 | hide | past | favorite | 63 comments


> ...there are great decentralized version control systems like Pijul/Git/Fossil and many more that check every requirement that I have, but they are built to work with textual data and are therefore unsuited to be a database backend of a graphical application.

> I basically want a DVCS that doesn't operate on text files, but on a proper data model like relational algebra or algebraic datatypes.

Prolog + git.

Storing a text file of Prolog rules defining your DB is not as wild as it might sound. It's IMO much nicer to work with than SQL.

- - - -

TerminusDB (in Prolog)

> TerminusDB started life as a quality control system and only later morphed into a full database because we couldn't find a storage layer that was fast enough to support quality controlled transactions at scale.

> Terminus uses the W3C's OWL language to define the schema of its databases. This gives Terminus a uniquely powerful and expressive way of defining rules for the shape and structure of the data that it stores.

> OWL is a very rich language based on first order logic and set theory.

https://terminusdb.com/docs/schema

- - - -

CQL (in Java, I believe)

> The open-source Categorical Query Language (CQL) and integrated development environment (IDE) performs data-related tasks — such as querying, combining, migrating, and evolving databases — using category theory,

https://www.categoricaldata.net/


Similarly, if you are merely destructuring databases into text formats, most SQL database engines are pretty good about round-tripping their "dump" formats. For certain sizes of, for instance, SQLite databases, it is entirely possible to git store the SQL dumps and rebuild the database from them to query/work with them. (Whether it is reasonable to do it that way is another question.)


TerminusDB is about to launch a set of features that are all about... (drum roll) Decentralized Version Control for Structured Data!


We needed to solve a similar problem - version control & synchronize .json files from different machines (annotations for ML models).

Writing a custom git merge driver was quite painless - a cmdline script (written in Python), which has task-specific logic on how to merge data from these .json files. Load these files, parse them, decide how to combine, detect unresolvable conflicts, etc.

It seems one may need custom logic to merge structured data, there is not a single best solution. This could make creation of a generic tool harder.

git is not a bad base technology for this. I'm not sure what other things are we missing (e.g. better diffs for structured data?), because .json is still text; it is just merges which are unreliable if you treat .json as text. There are also caveats - e.g. you can't install a custom merge driver on github, so "merge" button becomes dangerous. But overall for .json this approach works fine.


Have you looked into DVC[1] for versioning the data and pipelines that generate them? I have set up a few versioned dataset repositories with it now and quite like it, especially the ability to simply `dvc import` the versioned data into projects and checkout different versions for testing with various models.

It operates on data at the same level as git but with features needed for large datasets and is totally language and framework agnostic like git.

[1]: https://dvc.org/


We looked into it, but it seems to be solving a different problem - how to handle large data. Does it solve merging of structured data?

E.g. a json file is chanaged on 2 machines, and you need to merge the changes. Sometimes you can merge (e.g. 2 different entries in an array where people are adding annotations), sometimes you need to raise an error - e.g. changes in a single record, but for different fields - depending on a problem, you may disallow it, to keep the record consistent.


I wanted to write a git diff for files like KiCad or even Word. I didn’t know custom git merges were a thing. Do you have a link for how to get started?


Both custom diff and merge drivers are described at a high-level in gitattributes(5)¹. They're pretty useful even in really basic ways such as adding a textconv with "jq -S ." or "xmllint --pretty 2" to pretty print JSON or XML before calculating diffs.

Plus, if you've already dipped in to those docs to see the diff options be sure to check the funcname attribute too. It allows you to add custom diff(1)-style `--show-function-line` options. For example, you can use an ugly regex such as `^\\[\\(.*\\)\\]$` to guess section names in .ini file diffs. Or the wordRegex option to make CSV files break on fields with `git diff --word-diff`. Or... well, thousands of other things. There are tonnes of things you can do to improve diff and merges for textual data in addition to the things you may want to do binary blobs.

1. https://git-scm.com/docs/gitattributes


The API is quite simple - you need to implement a script which takes 3 arguments, writes a result of a merge to a file, and exits with non-zero status code in case of merge error. Quote from https://git-scm.com/docs/gitattributes#_defining_a_custom_me...:

To define a custom merge driver filfre, add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this

  [merge "filfre"]
    name = feel-free merge driver
    driver = filfre %O %A %B %L %P
    recursive = binary
The merge.?.name variable gives the driver a human-readable name.

The merge.?.driver variable’s value is used to construct a command to run to merge ancestor’s version (%O), current version (%A) and the other branches' version (%B). These three tokens are replaced with the names of temporary files that hold the contents of these versions when the command line is built. Additionally, %L will be replaced with the conflict marker size (see below).

The merge driver is expected to leave the result of the merge in the file named with %A by overwriting it, and exit with zero status if it managed to merge them cleanly, or non-zero if there were conflicts.


For Word documents I had some luck storing the unzipped contents of the file (since a DOCX is mostly XML files in a ZIP container). My approach was automating the zip/unzip process (and some cleanup steps) pre-commit and post-checkout. https://github.com/WorldMaker/musdex

Though also specifically for Word files your best bet might be to launch Word's own compare-files GUI tools as a merge engine, but I had several reasons at the time to explore a "container destructuring tool" for source control.


https://www.dolthub.com/ ?

Never used it myself, but it's one of the many "git for data" projects I've seen go past. I'm interested that none of them meet the authors need.


For being decentralized, dolt is really not good at getting the message across, I don't find anything in the documentation how I can get the collaboration/repository stuff to work without DoltHub, which seems to be their centralized service.

One of the neat things with Git and others is that it doesn't really care about what type of endpoint you're fetching data from, you can `git clone` from a unix directory, NFS shared directory, over ssh, via Keybase and so on. Dolt doesn't seem to support anything else than using DoltHub (if it does, they really need to update the documentation).


Where did you look? I clicked the "GitHub" link at the top, which is the standard way to bypass marketing fluff and get to an actual program. The README has instructions for installing from source, verifying the installation, configuring, creating an empty local repo, adding data to it, etc. It barely mentions DoltHub.


I think you misunderstood my message. I'm not asking how to install dolt, I'm asking where the whole "decentralized" feature-set is described and implemented in dolt, as I've scoured the docs, both marketing page and the GitHub repository, and the collaboration features seems to be locked to dolthub, which is very different from for example git.

So I create a local repo with dolt, and now I want someone else to collaborate on this dataset with me, without using dolthub. How I do this?

The article in the submission has "decentralized" as the first requirement, so I assume that dolt is mentioned here because it's somehow decentralized. But I cannot for the life of me find where that's mentioned


> So I create a local repo with dolt, and now I want someone else to collaborate on this dataset with me, without using dolthub. How I do this?

The best I could find is utils/remotesrv in the repository [0], which seems to enable collaboration.. but also seems bit basic and it's not clear if https is supported.

[0] https://github.com/liquidata-inc/dolt/tree/master/go/utils/r...


Seems to confirm my suspicion that dolt is not made with decentralization in mind and their collaboration is mostly tied to dolthub and a specific protocol that currently just work over http. Thanks for digging that out!


Sounds like that provides exactly what OP was looking for (unless he just wants to mess with the problem himself.)

The Dolt project seems poorly named to me. I know that someone always says that, and usually someone else points out that 'git' is a pejorative in British slang. Whatever.

They want you to curl something to install it. I browse w/ JS disabled so all I can get is:

    sudo curl -L https://github.com/l
Thus ends my interest in this Dolt. YMMV


> Sounds like that provides exactly what OP was looking for

Please correct me if I'm wrong, but it seems dolt clearly isn't what OP is looking for. How is dolt decentralized at all? See this comment thread where we try to figure that out: https://news.ycombinator.com/item?id=22848684


I'm surprised nobody has mentioned datalad/git-annex. It might be relevant. It stores references to binary data in a git repository. And depending on the backend, like bup, it benefit from differential compression. And it is highly peer to peer, and has git-annex-sync[1], which will synchronize branches with its remotes.

Datalad is based on git-annex, and already being used for sharing large scientific datasets[2].

[1]: https://git-annex.branchable.com/sync/ [2]: https://www.datalad.org/


I think git-annex is pretty much the right answer if you are already a linux saavy person that expects a mature battle tested tool that integrates well with your existing linux workflow systems. It sits on top of git in a sensible discernible way.

https://www.datalad.org/for/git-users

Peoples non text data is going to be some form of indiscernible binary blob to whatever system is managing it. Think about what a diff looks like for a jpeg, excel file, or an audio file.

http://docs.datalad.org/en/latest/metadata.html

What about git-annex vs git-lfs?

1. """LFS Test Server is an example server that implements the Git LFS API. It is intended to be used for testing the Git LFS client and is not in a production ready state."""

2. git-lfs seems to want to store your files in the github/microsoft cloud, its not really ready to be deployed inside an existing systems workflow? If you need to check a tickbox and just assume Atlassian/Microsoft has you covered it might be a good choice.

3. git annex has a ton of integrations with multiple backends -

http://www.chiark.greenend.org.uk/doc/git-annex/html/special...

4. Poke around the git-annex website. I think you'll see a tool that's being used in a variety of workflows successfully for a decade plus. That's not true of really any of the other tools others have noted in this thread.


I've worked with both over the last few years. Git LFS is really not p2p like git-annex is. And it is less flexible. But maybe more stable, and git works with it out of the box.

You really have to pay for specialized LFS hosting, and you can't remove hosted LFS objects. And while it is possible to only download some LFS objects, it is not easy or intuitive. Once a repo gets to be beyond a certain size, it an be really impractical.

Git annex lets you fairly easily get only the objects you want. And it is strongly p2p. You can just set up another repo as a remote, and git-annex-copy whatever files you want to directly over ssh. Or, you can use Google Drive or DropBox or any number of other hosting services that git-annex knows, and stash your files there. Sibling repos will be updated about where they can find the objects. And it is really easy to push your binaries to as many backends as you want.

Finally, you can find massive multi terabyte repos of medical image data[1]. There's no way that LFS could handle these terabyte repos. It is really easy to fork one of these data repos, change it, put your changed files into your own backend, and then reshare it.

[1]: http://datasets.datalad.org/


Also of note (from my understanding) is that git lfs local repo is double the size of the underlying data because it doesn't hardlink.


You may want to have a look at qri "query" - https://qri.io (disclosure, I work there).

Free & open-source tools for dataset versioning built on IPFS (the distributed web). Qri datasets contain commit histories, immutable hashes for each version (commit), schema info, metadata, readmes, & transform scripts, all of which ride together with the data (or, body).

Latest versions of our CLI tools support SQL & version diffing. We also have an electron app, Qri Desktop (https://qri.io/desktop)


PgUp PgDown Home and End keys don't work for me on your site. (I'm using Firefox 75.)


Thanks for letting us know. I opened an issue here which we will address as soon as possible: https://github.com/qri-io/website/issues/202


Cheers!


Is what they want just the same thing as a Conflict-free Replicated Data Type (CRDT)?

https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

The only thing on their list of requirements that might not be covered is the last item, "collaborative". (If replication is a solved problem, it removes some obstacles to that, even though it doesn't give you collaboration for free.)


https://irmin.org/ seems like it it may fit the bill:

> A distributed database built on the same principles as Git

I haven't used it myself, but it has been around for years and seems pretty mature.


>A recent paper suggested a new mathematical point of view on version control. I first found out about it from pijul, a new version control system (VCS) that is loosely inspired by that paper.

https://jneem.github.io/merging/


I think this is quite relevant to the problem.

Whatever your data is, you need a way to merge it, but the format of allowable patches and the algorithms used must surely depend on the data structure.

Imagine first a set of strings. This may be merged more easily than the list of lines a dvcs usually deals with but you still get conflicts (I’m not sure what the merge of “add foo” and “add foo; remove foo” should be. It could reasonably be “add foo; remove foo” or a conflict of maybe adding foo)

Now imagine some kind of recursive structure (say a unix filesystem structure except the files are all empty). It seems hard to come up with a model for patches that will work well with things like removing directories or moving subtrees. Are these different patches/

move the contents of a directory too to another newly created directory and then delete the first directory

move a directory to a new place

How would these merge with creating a new file under the directory before things are moved?

For a set of relations, you want merging to preserve properties of primary and foreign keys. I’m not sure what merging that would look like.

These all sound like interesting hard problems. I’m not so convinced that their solution would be particularly useful.


> Imagine first a set of strings. This may be merged more easily than the list of lines a dvcs usually deals with but you still get conflicts (I’m not sure what the merge of “add foo” and “add foo; remove foo” should be. It could reasonably be “add foo; remove foo” or a conflict of maybe adding foo)

My imagination is pretty confused by the application of a binary operator with only one argument. Would you please clarify? I think you'll find that when you explicitly think about the unstated arguments the problem makes sense, but I could be misunderstanding your point.


The point, as per the article mentioned in the parent comment, is to talk about merging patches rather than snapshots. This allows you to force certain nice properties of merges like associativity.

Sometimes you need information about deleted things in your patch to be able to merge correctly though I don’t have an example from the top of my head



That's exactly what we have developed at [bohr.app](https://bohr.app).

As far as I can tell, our solution checks all the requirements mentioned in the post. We have a decentralized data storage system that supports delta syncing, sends/receives data over encrypted P2P communications between multiple devices/users without a "central" data storage, and leverages some concepts taken from blockchain technology to ensure data integrity and immutability.


A few projects I'm involved with that might be worth sharing here:

- Redwood (https://github.com/brynbellomy/redwood), a realtime, p2p database. Data is structured in state trees that evolve over time. Merge algorithms are configurable at any point in a given state tree, but the default is a CRDT algorithm called "sync9".

- Braid (https://github.com/braid-work and https://braid.news), a draft IETF spec that specifies certain extensions to HTTP that make it much easier to build systems like Redwood. The Braid spec is under active development on Github, and we welcome input from anyone interested in the idea.

- Axon (http://axon.science and https://github.com/AxonNetwork), marketed as a platform for making it easier to collaborate on scientific research, but under the hood, it's basically just some extensions to git that allow you push commits over a peer-to-peer network bound together by a DHT.

I would also highly recommend getting involved in the Internet Archive's DWeb meetups and conferences, where you'll find hundreds of people interested in solving exactly these kinds of problems.


I went to a meetup on TerminusDB. It seemed like a cool project and quite mature. https://terminusdb.com/


I'm using "topic-based versioning"[0] and it solves most of the issues presented here. The principles are quite similar to what you would find in a database, with a central data repository and a write-ahead log.

Each "table" is organized in ordered, write-only topics (or folders) containing immutable messages (or files). Each operation is also adding a message into a special topic we use as log. Each participant have to remember the message id ("cursor") at which it was on the last sync and fetch new messages on the log first. Then, it's only a matter of applying the new ops until the desired point-in-time.

- decentralized, private, efficient: you own a complete copy of all versioned datasets, and you can store deltas between versions since all messages are ordered in topics. Each sync is done like a "git pull --rebase".

- reliable: If you follow the principles, conflicts are not possible.

- collaborative: Here's the slightly difficult requirement. You can choose to defer message id ("cursor") allocation to another distributed service (like zookeeper, for example) and accept to be online to register new write operations on a topic ; or force the system to be linearizable.

[0]: I wrote an article describing this design pattern here : https://medium.com/bcggamma/topic-based-versioning-architect... ! Any feedback welcomed !


Any immutable database? Datomic is one example.

On a related note: the idea of a database that loses data is perplexing, why would anyone want that? Why isn’t data retention a default and limiting that retention an opt-in choice? When was the last time you rewrote your git repo to delete some old commits that you no longer need?


A proper "DVCSD" would help a lot with software packaging, configuration packaging and deployment, immutable infrastructure.

It could allow fine-grained control and vetting of code and configuration.

(and overcome the security disasters called containers and "configuration management")

It's sad that few people understand this.


I'd like to see a system like this that treats the type system (aka schema) for the data as just more data and puts that under DVCS too.

Then I'd like to add a robust security/permissions model and build an OS around it.


You might be interested in this: https://www.categoricaldata.net/. Categorical Query Language (CQL).


I’ll take a look, thanks!


I'm sure you can find examples of that... on thedailywtf.com sadly.


Wasn't this IBM Lotus Notes?


why use the past tense? It is still being sold and used.


I'm a little concerned that the author is basing their approach on on Pijul given that its data loss issues are exactly what you wouldn't want for the "reliable" part of their priorities, but hopefully that's just an issue with Pijul's implementation of the ideas. It seems plausible that there aren't any major pitfalls in the theory itself instead of the code.


It's super unhelpful to write a post like this while showing very little evidence that you Googled the thing you want to see if it exists already. What about https://docs.dat.foundation/ or everything else in these comments?


I like Dat, but right now it fails on the author's reliability criterion. And it's not clear[1] whether multi-writer is still a work in progress, which would mean a failure on "collaborative", too.

1. Looks like it's still not finalized, but it might be—and the uncertainty about the answer is a failure in and of itself.


Many projects[1][2][3][4] in the dat ecosystem use kappa-core[5] for multi-user applications on top of hypercores (the low-level append-only log used by dat). kappa-core is designed around the kappa architecture where the logs serve as the historical record and primary source of truth (so you get version control) and materialized views ingest the logs to answer application-specific queries.

Some nice properties of the kappa-core implementation of this architecture are that:

* works fully offline by default and many of the apps (especially mapeo) are designed for very offline use

* device to device replication is first-class

* you can completely rebuild the materialized views from the log data whenever your schema changes (very nice way of handling database migrations)

* there's a lot of flexibility in how you design the materialized views and an ecosystem of views on npm you can use instead of writing your own crdts

* works in the browser

There is also some progress in the ecosystem for sparse mode where content downloads from other feeds are driven by application-specific queries.

There is a kappa-core workshop[6] that covers some of the introductory topics.

[1]: https://cabal.chat/

[2]: https://mapeo.world/

[3]: https://cobox.cloud/

[4]: https://arso.xyz/

[5]: https://github.com/kappa-db/kappa-core

[6]: https://kappa-db.github.io/workshop/build/01.html


Dat multi-writer already exists, here's a demo of it: https://dat-shopping-list.glitch.me/


Unfortunately that demo was built on a protocol prototype which was deprecated.

Multiwriter is still being worked on, but it's not in the current release schedule. The upcoming release focuses on performance, scaling, reliability, and "mounts" (effectively symlinks across drives). Mounts can be used to create a kind of multiwriter, as the mounts stay in the control of the author, but we don't yet have "multiple authors of a shared folder."

In user-space, I've been able to create unioned folders like Plan9 did which is a _serviceable_ multiwriter scheme. A more sophisticated EC approach would use a vector clock in file metadata to track revisions but would need some approach to tombstones, which I don't have solution for yet. It's solvable, it'll just take time and performance tuning.


I see, thank you


You can take a look at this talk: https://www.youtube.com/watch?v=DEcwa68f-jY. It describes how to build a dapp with sqlite + CRDT


What about Noms?

"The versioned, forkable, syncable database"

https://github.com/attic-labs/noms


Unfortunately development has stalled out

https://github.com/attic-labs/noms/blob/master/README.md#sta...

> Nobody is working on this right now. You shouldn't rely on it unless you're willing to take over development yourself.


FWIW, we (replicache.dev) have begun working on it again. We're part of the original team that built Noms.

Unclear what the roadmap for Noms will be yet, which is why we've not updated the README.


Nice catch, thanks!


I also want a widely adopted purely functional distributed data structure in the Okasaki style. Persistence is wonderful.


Doesn’t this map nicely onto functional data structures, where each new “version” is just a new top level reference?


The hard part is merging different changes.


How would one merge updates from all those offline clients?


git-lfs+torrents (over SSH if you need authentication?)


IDGI, just store the structured data in Git.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: