It's a classic topic of study in the field of distributed systems. The overarchi...

It's a classic topic of study in the field of distributed systems.

The overarching goal is consensus between agents (e.g servers) over the state of the system.

For example, suppose that you want to achieve atomic consistency (i.e at any point in time the committed/persistent state of the system is homogeneous across all live nodes) in a private cluster replicating a database. The incoming requests are routed randomly to any node in your cluster. You have a synchronization mechanism (a consensus algorithm - like any-Paxos, Raft, Viewstamped Replication etc.) that lets you get this cluster of machines to agree on the state of the database. This is useful if you want fault-tolerance. The synchronization part is the consensus. The agents have to agree on the state-transition.

Now, observe two things:

1. This is a private cluster you own. In other words, you can make all the machines run the same software, trust each other, and assume they act on a best effort basis.

2. The failure model is "fail-stop". If a machine encounters some kind of failure, it won't exhibit arbitrary behavior but rather crash immediately. And you do not consider the case where a machine is taken over and tries to mess with the rest of the cluster in order to violate the desired safety guarantee (atomic consistency).

Obviously, these works if you are a privately own company and can spend resources hardening the parts of your network that are susceptible to be taken over. In a private setting, these assumptions are sufficient even though they are weak and optimistic. In practice, this is not so much an issue.

This falls apart however when you are in an adversarial setting and when your cluster is an open and public network that anyone can join or leave at any point in time. In that model, malicious actors can also show-up and exhibit/fake any kind of failure they want (e.g re-order messages, lie on their internal state, attempt to spoof other agents etc.). In that case, we say the fault-tolerance model is Byzantine. We go one level above in difficulty from the comfortable fail-stop assumption.

That's the gist of what Byzantine fault-tolerance is about. It defines the ability of a distributed system to resist a subset of its network trying to break consensus by any means possible as long as a sufficient majority remains honest.

For years, the major result on Byzantine Fault-Tolerance (BFT) had been a system called PBFT which although innovative and first of its kind, suffered several limitations. One of them was that any node in the network needed to have a complete list of the other nodes in the network, another was high-overhead, and communication complexity.

The innovation of Bitcoin cannot be explained without providing full-context of many different topics (defining asynchronicity, FLP result etc.) but the crux of it is that it offered a way to circumvent these limitations by massaging some of the requirements and cleverly aligning incentives such that you get a system where: 1/ safety is preserved even in an adversarial setting 2/ anyone can join/leave 3/ you get some degree of asynchronicity, 4/ many other things.

I realize this cannot be a satisfactory or exhaustive exposition of all the innovation that Bitcoin introduces but that would turn this comment into x10 its current length.