This is one of my favorite pieces of software engineering because it took someth...

throwawaymaths · on Sept 27, 2024

Weirdly it's also kinda worse is better: raft is non-deterministic and has an unboundedly long election cycle time. IIRC:

- it assumes no hysteresis in network latencies and if there is a hysteresis it's possible that elections can be deterministically infinite.

- this fact and the use of raft in production has caused real, large scale network outages.

Paxos is of course a beast and hard to understand. There is an alternative, VSR (which was developed ~time of paxos) which is easy to understand and does not have the issues caused by election nondeterminism in raft.

Of course everyone uses raft so raft dominates.

eatonphil · on Sept 27, 2024

> - this fact and the use of raft in production has caused real, large scale network outages.

While this has surely happened, I am not so confident about what the reasons were for this. If you've got links on details I'd love to read.

> which is easy to understand

I've implemented core bits of Raft twice now and have looked at VSR a couple of times and VSR wasn't easier for me to understand. I'm sure I could implement VSR and would like to some day, but just comparing the papers alone I personally felt like Raft was better presented (i.e. easier to understand).

Also keep in mind that nobody ships consensus implementations exactly in line with the original paper. There are dozens or hundreds of papers on variations and extensions of Raft/Paxos and every actual implementation is going to implement some selection of these extensions/variations. You have to look at each implementation carefully to know how it diverges from the original paper.

throwawaymaths · on Sept 27, 2024

> A known limitation of the base Raft protocol is that partial/asymmetric network partitions can cause a loss of liveness [27, 32]. For instance, if a leader can no longer make progress because it cannot receive messages from the other nodes, it continues to send AE heartbeats to followers, preventing them from timing out and from electing a new leader who can make progress.

(Howard, Abram et al)

Me: note this can also occur if there isn't a complete outage, if the latency back to the shit leader is different from the latency out of the shit leader.

> nobody ships consensus implementations exactly in line with the original paper. There are dozens or hundreds of papers on variations

As the paper above explains once you add extensions you might have broken the correctness proofs in raft. More to the original point, you're now in a state where it's no longer "simple"... I would go so far as to say if you have to consider the extensions, which are distributed over several papers and sometimes not even papers at all, you're in "deceptively simple" land.

As a pedagogical tool, raft is valuable because it can be a launching ground for conversations like these... But maybe we shouldn't use it in prod when there are better, straightforward options? I get the feeling that being hard sold as simple nerdsniped devs into writing it and someone r/very smart put it into prod and with social proof more people did and now here we are

lifeinthevoid · on Sept 27, 2024

> A known limitation of the base Raft protocol is that partial/asymmetric network partitions can cause a loss of liveness [27, 32]. For instance, if a leader can no longer make progress because it cannot receive messages from the other nodes, it continues to send AE heartbeats to followers, preventing them from timing out and from electing a new leader who can make progress.

Real-world raft implementations make the leader step down if it hasn’t heard from a quorum for a while. Not part of vanilla raft though.

eatonphil · on Sept 27, 2024

The thesis does describe doing this fwiw while the paper does not.

convolvatron · on Sept 27, 2024

don't forget about calm. its a shame that that isn't the default we reach for, and only struggle when we really need stronger latency.

https://arxiv.org/abs/1901.01930

prydt · on Sept 27, 2024

I think the CALM theorem and this whole line of research is so interesting and it is still carried on by the CRDT people. But I would love to see more of this.

I feel like it doesn't get as much attention as it deserves.

MarkMarine · on Sept 28, 2024

> this fact and the use of raft in production has caused real, large scale network outages.

Paxos as well, I remember full cloud GCP outage that had something to do with Paxos, and I can’t find the data on it but I thought there was a nasty bug in zookeeper paxos implementation.

That isn’t to say any of these are perfect or bug free, it’s made by humans and we’re going to make mistakes, but my experience implementing both was I had a working raft implementation and paxos baked my brain until I gave up.

I think everyone uses raft _because_ it was possible to implement for a working dev, so there are a number of implementations, and it’s easier to understand the phases the application is in.

I’ll check out VSR I appreciate the rec.

lanstin · on Sept 28, 2024

My zookeeper outages are always due to simple things like all the workers having the same basic image and then having increased write rate filling up the disk of all workers at the same time.

kfrzcode · on Sept 27, 2024

I've made a longer comment in another thread; but have you investigated the hashgraph algorithm? Gossip-about-gossip and virtual voting combine to result in leaderless consensus, fair ordering and aBFT. It's very performant with over 10k TPS on-chain. I'm learning about DLT from the perspective of hashgraph which is why I don't understand why it doesn't get love - it seems to have all of the good and none of the bad.

bvrmn · on Sept 29, 2024

hashgraph seems like snake-oil, and marketed in similar ways. Couldn't find any good paper with enough details to make a production ready implementation.