> What is the order of the two calls to `query` actually matters, but how could the language know?
The language can't "know the order" of these calls, since they are not ordered. No information is passed from one call to the other, hence neither is in each other's past light cone.
If you want to impose some order, you can introduce a data-dependency between the calls; e.g. returning some sort of value from the "INSERT" call, and incorporating that into the "SELECT" call. Examples include:
- Some sort of 'response' from the database, e.g. indicating success/failure
- GHC's implementation of IO as passing around a unit value for the "RealWorld"
- Lamport clocks
- A hash of the previous state (git, block chains, etc.)
- The 'connection' value itself (most useful in conjunction with linear types, or equivalent, to prevent "stale" connections being re-used)
- Continuation-passing style (passing around the continuation/stacktrace)
> languages that don't strictly define it make this type of thing very error prone
On the contrary, attempting to define a total order on such spatially-separated events is very error prone. Attempting to impose such Newtonian assumptions on real-world systems, from CPU cores to geographically distributed systems, leads to all sorts of inconsistencies and problems.
This is another example of opt-ins being better than defaults. It's more useful and clear to have no implicit order of calculations imposed by default, so that everything is automatically concurrent/distributed. If we want to impose some ordering, we can do so using the above mechanisms.
Attempting to go the other way (trying to run serial programs with concurrent semantics) is awkward and error-prone. See: multithreading.
Note that you haven't specified the database semantics either.
Perhaps the connection points to a 'snapshot' of the contents, like in Datomic; in which case doing an "INSERT" will not affect a "SELECT". In this case, a "SELECT" will only see the results of an "INSERT" if we query against an updated connection (i.e. if we introduce a data dependency!).
Perhaps performing multiple queries against the same connection causes the database history to branch into "multiple worlds": each correct on its own, but mutually-contradictory. That's how distributed systems tend to work; with various concensus algorithms to try and merge these different histories into some eventually-consistent whole.
PS: There is a well-defined answer in this example; since the "INSERT" query is dead code, it should never be evaluated ;)
PPS: Even in the "normal" case of executing these queries like statements, from top-to-bottom, against a "normal" SQL database, the semantics are under-defined. For example, if 'query' is asynchronous, the second query may race against the first (e.g. taking a faster path to a remote database and getting executed first). This can be prevented by making 'query' synchronous; however, that's just another way of saying we need a response from the database (i.e. a data dependency!)
In most programming languages, the order of the two statements would be well defined, and neither would be dead code.
Trying to make all statements implicitly concurrent unless they have explicit dependencies is a terrible way to complicate your life. That in some cases you can optimize the code (or the CPU will do it for you) by executing it in certain other orders where it is safe is supposed to remain an invisible optimization.
It is obvious to everyone that distributed code, eventual consistency, and other similar non-totally-ordered examples are much harder to get right than procedural code. Even simple print-based debugging/logging becomes excessively complex if you get rid of the local total ordering.
Even most network-based computing is done over TCP (or, more recently, QUIC) exactly because of how enormously useful having a total order is in practice (even if it's just an illusion/abstraction).
> PS: There is a well-defined answer in this example; since the "INSERT" query is dead code, it should never be evaluated ;)
It's only dead code if you assume Haskell's bizarre lazy execution model. In virtually every other language, unless the compiler can prove that query() has no side effects, the INSERT will be executed.
> In most programming languages, the order of the two statements would be well defined, and neither would be dead code.
There are no statements in the above examples, just expressions (composed of other expressions). Evaluation order of expressions is not well defined in "most popular languages", e.g. consider this expression:
printSecondArgument(
query(connection, "INSERT INTO t1 VALUES ('abc')"),
query(connection, "SELECT * FROM t1")
)
> Trying to make all statements implicitly concurrent unless they have explicit dependencies is a terrible way to complicate your life
I agree, that's one reason why I dislike statements (see my list in a parent comment)
> It is obvious to everyone that distributed code, eventual consistency, and other similar non-totally-ordered examples are much harder to get right than procedural code.
This is a category error. All code is distributed; the world is partially-ordered. "Procedural code" (i.e. serial/sequential execution) is a strategy for dealing with that. It's particularly easy (give each step a single dependency; its "predecessor"), but also maximally inefficient. That's often acceptable when running on a single machine, and sometimes acceptable for globally-distributed systems too (e.g. that's what a blockchain is).
Forcing it by default leads to all sorts of complication (e.g. multithreading, "thread-safety", etc.). Making it opt-in gives us the option of concurrency, even if we write almost everything in some "Serial monad" (more likey, a continuation-passing transformer)
In most popular languages, the order of evaluation of both statements and expressions is specified. For your example, the insert query call is guaranteed to happen before the select query call in Java, C#, JS, Go, Python, Ruby, Rust, Common Lisp, SML. It is indeed unspecified in C, C++, Haskell, Scheme, OCaml.
While C and C++ are extremely commonly used, I would still say that the majority of popular languages fully define evaluation order. Even more so since most of these languages considered they are fixing a flaw in C. Rust is particularly interesting here, as they initially did not specify the order, but then reversed that decision afelter more real-world experience.
> "Procedural code" (i.e. serial/sequential execution) is a strategy for dealing with that. It's particularly easy (give each step a single dependency; its "predecessor"), but also maximally inefficient.
> Forcing it by default leads to all sorts of complication (e.g. multithreading, "thread-safety", etc.). Making it opt-in gives us the option of concurrency, even if we write almost everything in some "Serial monad" (more likey, a continuation-passing transformer)
You yourself admit that serial code is a strategy for dealing with the complexity of the world - it doesn't complicate anything, it greatly simplifies things.
Threads and other similar constructs are normally opt-in and used either to model concurrency that is relevant to your business domain, or to try to achieve parallelism as an optimization. They are almost universally seen as a kind of necessary evil - and yet you seem to advocate for introducing the sort of problems threads bring into every sequential program.
Thinking of your program as a graph of data dependencies is an extremely difficult way to program, especially in the presence of any kind of side effects. I don't think I've ever seen anyone argue that it's actually something to strive for.
Even the most complete and influential formal model of concurrent programming, sir Tony Hoare's CSP, is aimed at making the order of operations as easy to follow as possible, with explicit ordering dependencies kept to a minimum (only when sending/receiving messages).
> That's often acceptable when running on a single machine, and sometimes acceptable for globally-distributed systems too (e.g. that's what a blockchain is).
It's not just blockchain: TCP and QUIC, ACID compliance and transactions, at-least-once message delivery in pub-subs: all of these are designed to provide an in-order abstraction on top of the underlying concurrency of the real world. And they are extremely popular because of just how much easier it is to be able to rely on this abstraction.
> You yourself admit that serial code is a strategy for dealing with the complexity of the world - it doesn't complicate anything, it greatly simplifies things.
Serial code greatly simplifies serial problems. If you want serial semantics, go for it. Parent comments mentioned Haskell, which lets us opt-in to serial semantics via things like `ST`. Also, whilst we could write a whole program this way, we're not forced to; we can make different decisions on a per-expression level. Having rich types also makes this more pleasant, since it prevents us using sequential code in a non-sequential way.
However, there is problem with serial code: it complicates concurrency. Again: if you don't want concurrency then serial semantics are fine, and you can ignore the rest of what I'm saying.
> Threads and other similar constructs are normally opt-in and used either to model concurrency that is relevant to your business domain, or to try to achieve parallelism as an optimization. They are almost universally seen as a kind of necessary evil
Such models are certainly evil, but not at all necessary. They're an artefact of trying to build concurrent semantics on top of serial semantics, rather than the other way around.
> and yet you seem to advocate for introducing the sort of problems threads bring into every sequential program.
Not at all. If you want sequential semantics, then write sequential programs. I'm advocating that concurrent programs not be written in terms of sequential semantics.
Going back to the Haskell example, if we have a serial program (e.g. using `ST`) and we want to split it into a few concurrent pieces, we can remove some of the data dependencies (either explicit, or implicit via `do`, etc.) to get independent tasks that can be run concurrently (we can also type-check that we've done it safely, if we like). That's easier than trying to run the serial code concurrently (say, by supplying an algebraic effect handler which doesn't obey some type class laws) then crossing our fingers. The latter is essentially what multithreading, and other unsafe uses of shared-mutable-state are doing.
The language can't "know the order" of these calls, since they are not ordered. No information is passed from one call to the other, hence neither is in each other's past light cone.
If you want to impose some order, you can introduce a data-dependency between the calls; e.g. returning some sort of value from the "INSERT" call, and incorporating that into the "SELECT" call. Examples include:
- Some sort of 'response' from the database, e.g. indicating success/failure
- GHC's implementation of IO as passing around a unit value for the "RealWorld"
- Lamport clocks
- A hash of the previous state (git, block chains, etc.)
- The 'connection' value itself (most useful in conjunction with linear types, or equivalent, to prevent "stale" connections being re-used)
- Continuation-passing style (passing around the continuation/stacktrace)
> languages that don't strictly define it make this type of thing very error prone
On the contrary, attempting to define a total order on such spatially-separated events is very error prone. Attempting to impose such Newtonian assumptions on real-world systems, from CPU cores to geographically distributed systems, leads to all sorts of inconsistencies and problems.
This is another example of opt-ins being better than defaults. It's more useful and clear to have no implicit order of calculations imposed by default, so that everything is automatically concurrent/distributed. If we want to impose some ordering, we can do so using the above mechanisms.
Attempting to go the other way (trying to run serial programs with concurrent semantics) is awkward and error-prone. See: multithreading.
See also https://en.wikipedia.org/wiki/Relativistic_programming
Note that you haven't specified the database semantics either.
Perhaps the connection points to a 'snapshot' of the contents, like in Datomic; in which case doing an "INSERT" will not affect a "SELECT". In this case, a "SELECT" will only see the results of an "INSERT" if we query against an updated connection (i.e. if we introduce a data dependency!).
Perhaps performing multiple queries against the same connection causes the database history to branch into "multiple worlds": each correct on its own, but mutually-contradictory. That's how distributed systems tend to work; with various concensus algorithms to try and merge these different histories into some eventually-consistent whole.
PS: There is a well-defined answer in this example; since the "INSERT" query is dead code, it should never be evaluated ;)
PPS: Even in the "normal" case of executing these queries like statements, from top-to-bottom, against a "normal" SQL database, the semantics are under-defined. For example, if 'query' is asynchronous, the second query may race against the first (e.g. taking a faster path to a remote database and getting executed first). This can be prevented by making 'query' synchronous; however, that's just another way of saying we need a response from the database (i.e. a data dependency!)