This is one of my favorite pieces of software engineering because it took something difficult and tried to design something easy to understand as a main criteria for success. The PHD Thesis has a lot more info about this if anyone is curious, it is approachable and easy to read:
Weirdly it's also kinda worse is better: raft is non-deterministic and has an unboundedly long election cycle time. IIRC:
- it assumes no hysteresis in network latencies and if there is a hysteresis it's possible that elections can be deterministically infinite.
- this fact and the use of raft in production has caused real, large scale network outages.
Paxos is of course a beast and hard to understand. There is an alternative, VSR (which was developed ~time of paxos) which is easy to understand and does not have the issues caused by election nondeterminism in raft.
> - this fact and the use of raft in production has caused real, large scale network outages.
While this has surely happened, I am not so confident about what the reasons were for this. If you've got links on details I'd love to read.
> which is easy to understand
I've implemented core bits of Raft twice now and have looked at VSR a couple of times and VSR wasn't easier for me to understand. I'm sure I could implement VSR and would like to some day, but just comparing the papers alone I personally felt like Raft was better presented (i.e. easier to understand).
Also keep in mind that nobody ships consensus implementations exactly in line with the original paper. There are dozens or hundreds of papers on variations and extensions of Raft/Paxos and every actual implementation is going to implement some selection of these extensions/variations. You have to look at each implementation carefully to know how it diverges from the original paper.
> A known limitation of the base
Raft protocol is that partial/asymmetric network partitions
can cause a loss of liveness [27, 32]. For instance, if a leader
can no longer make progress because it cannot receive messages from the other nodes, it continues to send AE
heartbeats to followers, preventing them from timing out
and from electing a new leader who can make progress.
(Howard, Abram et al)
Me: note this can also occur if there isn't a complete outage, if the latency back to the shit leader is different from the latency out of the shit leader.
> nobody ships consensus implementations exactly in line with the original paper. There are dozens or hundreds of papers on variations
As the paper above explains once you add extensions you might have broken the correctness proofs in raft. More to the original point, you're now in a state where it's no longer "simple"... I would go so far as to say if you have to consider the extensions, which are distributed over several papers and sometimes not even papers at all, you're in "deceptively simple" land.
As a pedagogical tool, raft is valuable because it can be a launching ground for conversations like these... But maybe we shouldn't use it in prod when there are better, straightforward options? I get the feeling that being hard sold as simple nerdsniped devs into writing it and someone r/very smart put it into prod and with social proof more people did and now here we are
> A known limitation of the base Raft protocol is that partial/asymmetric network partitions can cause a loss of liveness [27, 32]. For instance, if a leader can no longer make progress because it cannot receive messages from the other nodes, it continues to send AE heartbeats to followers, preventing them from timing out and from electing a new leader who can make progress.
Real-world raft implementations make the leader step down if it hasn’t heard from a quorum for a while. Not part of vanilla raft though.
I think the CALM theorem and this whole line of research is so interesting and it is still carried on by the CRDT people. But I would love to see more of this.
I feel like it doesn't get as much attention as it deserves.
> this fact and the use of raft in production has caused real, large scale network outages.
Paxos as well, I remember full cloud GCP outage that had something to do with Paxos, and I can’t find the data on it but I thought there was a nasty bug in zookeeper paxos implementation.
That isn’t to say any of these are perfect or bug free, it’s made by humans and we’re going to make mistakes, but my experience implementing both was I had a working raft implementation and paxos baked my brain until I gave up.
I think everyone uses raft _because_ it was possible to implement for a working dev, so there are a number of implementations, and it’s easier to understand the phases the application is in.
My zookeeper outages are always due to simple things like all the workers having the same basic image and then having increased write rate filling up the disk of all workers at the same time.
I've made a longer comment in another thread; but have you investigated the hashgraph algorithm? Gossip-about-gossip and virtual voting combine to result in leaderless consensus, fair ordering and aBFT. It's very performant with over 10k TPS on-chain. I'm learning about DLT from the perspective of hashgraph which is why I don't understand why it doesn't get love - it seems to have all of the good and none of the bad.
hashgraph seems like snake-oil, and marketed in similar ways. Couldn't find any good paper with enough details to make a production ready implementation.
https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pd...
I think this was core to Raft’s success, and I strive to create systems like this with understandability as a first goal.