I think Leslie Lamport asserted that Paxos is minimal, and that "all other consensus algorithms are just Paxos with more steps". I'm inclined to believe him.
I've implemented Paxos but I can't get through "Raft for dummies" style blog posts.
Regarding Raft [1]:
> The consensus problem is divided into three sub-problems: Leader election, Replication and Safety.
What is leader election? It's a distributed system coming to consensus on a fact (i.e. who the leader is.) Then once you have the leader, you do additional steps. The entirety of Paxos is a distributed system coming to consensus on a fact.
When I read these posts, i see things like "timeout", "heartbeat", and I think: timeout according to whom? I read "once the leader has been elected", um, hangon, according to whom? Has node 1 finally agreed on the leader, just while node 3 has given up and started another election? I don't doubt that Raft is correct, but the writing about it seems simple by glossing over details.
Paxos, on the other hand, seems timeless. (And the writing about it doesn't trigger my "distributed system fallacies" reaction)
I tend to agree that many explanations of raft dont get into the useful details and handwave some of the hard problems. But the original paper does do a good job of this and is pretty accessible to read IMO.
> I read "once the leader has been elected", um, hangon, according to whom? Has node 1 finally agreed on the leader, just while node 3 has given up and started another election?
The simple response I think to "according to whom" is "the majority of voting nodes". When the leader assumes its role, it sends heartbeats which are then accepted by the other nodes in the cluster. Even if (in your example) node 3 starts a new election, it will only succeed if it can get a majority of votes. If node 2 has already acknowledged a leader, it won't vote for node 3 in the same term.
There's some implicit concessions inherent there around eventual consistency, but I don't think thats novel to Raft compared to other distributed consensus protocols.
> The simple response I think to "according to whom" is "the majority of voting nodes".
Reminds me of this one time we had a Raft cluster arguing over who was the leader for 20 minutes in production. Raft leader election is non-deterministic, while Paxos is deterministic. It can 'randomly' get into a situation it cannot resolve for quite a long time.
> Reminds me of this one time we had a Raft cluster arguing over who was the leader for 20 minutes in production
That's certainly an interesting failure mode. Do you recall the details around root cause? I could imagine ephemeral network partitions (flapping interfaces? peering loss?) causing something like this for sure.
In my own experience, I've been running services that use Raft under the hood for the last ~10 years in production and haven't seen this happen myself. Though I do absolutely remember having misconfigured election timeouts causing very painful latency issues in failover scenarios.
Ah, interesting. That sort of split voting is indeed very bad luck, potentially a config-specific issue, or just a cluster that's seeing a catastrophic partition failure between every node.
In canonical Raft assuming no partition failures, this could only happen if every node's election timeout triggered at roughly the same time and they all become candidates simultaneously. For this state to persist (assuming short election timeouts and short heartbeat intervals), you have to get _really_ unlucky.
In terms of probabilistic likelihood though, this is about as likely as the live-lock issue in Paxos in which multiple proposals with differing proposal ids are made at the same time. You'd seem a similar delay in consensus in that scenario as well. Obviously MultiPaxos handles this with a separate leadership algorithm which makes that outcome much less likely, but the same types of strategies common in those systems to mitigate contention issues can be used in Raft as well (randomized backoffs for example).
Yeah, IIRC, we updated the configuration some. I don't remember what specifically, but now that you mention short timeouts, I vaguely remember that coming up as a problem.
100% agree. I haven't read the raft paper in years, but I remember thinking there's just too much stuff in there. That stuff in important, but if you want people to understand what's happening they internalize the fundamental idea of being able to block other writers by bumping a number. Which is all covered in the single decree paxos section in part time parliment.
Paxos is nice, sure, but Hedera does DLT with aBFT and much more efficient, as well as being faster and ensuring fairness. It's leaderless, and achieves incredible TPS (10k+ in practice, 100k+ in theory).
I think Leslie Lamport asserted that Paxos is minimal, and that "all other consensus algorithms are just Paxos with more steps". I'm inclined to believe him.
I've implemented Paxos but I can't get through "Raft for dummies" style blog posts.
Regarding Raft [1]:
What is leader election? It's a distributed system coming to consensus on a fact (i.e. who the leader is.) Then once you have the leader, you do additional steps. The entirety of Paxos is a distributed system coming to consensus on a fact.When I read these posts, i see things like "timeout", "heartbeat", and I think: timeout according to whom? I read "once the leader has been elected", um, hangon, according to whom? Has node 1 finally agreed on the leader, just while node 3 has given up and started another election? I don't doubt that Raft is correct, but the writing about it seems simple by glossing over details.
Paxos, on the other hand, seems timeless. (And the writing about it doesn't trigger my "distributed system fallacies" reaction)
[1] https://www.brianstorti.com/raft/