> assuming the resulting shell script is as inscrutably as binary executable
It's quite the opposite, pnut generates shell code that's close to the original C code to make it easy to audit the code. A useful way to see pnut is as a tool that rewrites C code to POSIX shell, without significantly changing the structure.
This means that even if GCC is required for the initial compilation of pnut (GCC compiles pnut, then pnut compiles itself and we get the pnut-sh.sh script), the script can be "sanitized" from trusting trust attacks by simply comparing the script to the C code and making sure GCC hasn't introduced any malicious code.
As you point out, it moves the trust from the binary to the shell executable, but the shell is already a key piece of any build process and requires a minimum level of trust. The technique of bootstrapping on multiple shells and comparing the outputs is known as Double Diverse Compiling[0] and we think POSIX shell is particularly suited for this use case since it has so many implementations from different and likely independent sources.
The age and stability of the POSIX shell standard also play in our favor. Old shell binaries should be able bootstrap Pnut, and those binaries may be less likely to be compromised as the trusting trust attack was less known at that time, akin to low-background steel[1] that was made before nuclear bombs contaminated the atmosphere and steel produced after that time.
It seems ShellCheck errs on the side of caution when checking arithmetic expansions and some of its recommendations are not relevant in the context they are given. For example, on `cat.sh`, one of the lines that are marked in red is:
In examples/compiled/cat.sh line 7:
: $((_$__ALLOC = $2)) # Track object size
^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
^-----------------^ SC2046 (warning): Quote this to prevent word splitting.
^--------------^ SC2205 (warning): (..) is a subshell. Did you mean [ .. ], a test expression?
^-- SC2283 (error): Remove spaces around = to assign (or use [ ] to compare, or quote '=' if literal).
^-- SC2086 (info): Double quote to prevent globbing and word splitting.
It seems to be parsing the arithmetic expansion as a command substitution, which then causes the analyzer to produce errors that aren't relevant. ShellCheck's own documentation[0] mention this in the exceptions section, and the code is generated such that quoting and word splitting are not an issue (because variables never contain whitespace or special characters).
It also warns about `let` being undefined in POSIX shell, but `let` is defined in the shell script so it's a false positive that's caused by the use of the `let` keyword specifically.
If you think there are other issues or ways to improve Pnut's compatibility with Shellcheck, please let us know!
From our experience, ksh is generally faster, and dash sits between ksh and bash. One reason is that dash stores variables using a very small hash table with only 37 entries[0] meaning variable access quickly becomes linear as memory usage grows. But even with that, dash is still surprisingly fast -- when compiling `pnut.c` with `pnut.sh`, dash comes in second place:
For me `dash` compiles in just a few seconds. If you link to a 1-line problem (here, #define VTABSIZE 39), then why not boost that to 79 or 113, say, re-compile the shell and re-run your benchmark? Might lead to a change in upstream that could benefit everyone.
There was an effort to (re)start ksh93 development, but AT&T halted this effort. The bugfixes from the failed effort have moved back into Korn's last release.
It doesn't support NULs as you pointed out, but it's interesting to see similarities between your implementation and the one generated by Pnut.
Because we use `read -r`, we haven't tested reading binary files. Fortunately, the shell's `printf` function can emit all 256 characters so Pnut can at least
output binary files. This makes it possible for Pnut to have a x86 backend for the use of reproducible builds.
Regarding the use of `read`, one constraint we set ourselves when writing Pnut
is to not use any external utilities, including those that are specified by the
POSIX standard (other than `read` and `printf`). This maximizes portability of the code generated by Pnut and is enough for the reproducible build use case.
We're still looking for ways to integrate existing shell code with C. One way
this can be done is through the use of the `#include_shell` directive which
includes existing shell code in the generated shell script. This makes it possible to call the necessary utilities to read raw bytes without having Pnut itself depends on less portable utilities.
Sorry, but since the very goal of base64 is to encode "uncomfortable" bytes, saying that your example doesn't work with uncomfortable bytes is like providing a fibonacci demo that only works with arguments less than 3, or a clock that only shows correct time twice a day.
In the context of what it seems to be primarily attempting to achieve, assisting in the bootstrapping of more complex environments directly or indirectly dependent on C, I found the base64 example (more so the SHA-256 example in the same directory) quite interesting and evidence of the sophistication of pnut notwithstanding the limitations. And as was pointed out, it wouldn't be difficult to hack in the ability to read binary data: just swap in a replacement for the getchar routine, such as I've done with od. In fact, that ease is one of the most fascinating aspects of this project--they've built a conceptually powerful execution model for the shell that can be directly targeted when compiling C code, as opposed to indirection through an intermediate VM (e.g. a P-code interpreter in shell). It has it's limitations, but those can be addressed. Given the constraints, the foundation is substantial and powerful even from a utilitarian perspective.
When people discuss Turing completeness and related concepts one of the unstated caveats is that neither the concept itself, nor most solutions or environments, meaningfully address the problem of I/O with the external environment. pnut is kind of exceptional in this regard, even with the limitations.
That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.
Since I experimented with something similar in the past to mimick multidimensional arrays: depending on the implementation this can absolutely _kill_ performance. IIRC, Dash does a linear lookup of variable names, so when you create tons of variables each lookup starts taking longer and longer.
We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.
One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
> We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.
Interesting. When you say "even when hundreds of KBs are allocated", do you mean this is allocating variables with large values, or tons of small variables? My case was the latter, and with that I saw a noticeable slowdown on Dash.
Simplest repro case:
$ cat many_vars_bench.sh
#!/bin/sh
_side=500
i=0
while [ "${i}" -lt "${_side}" ]; do
j=0
while [ "${j}" -lt "${_side}" ]; do
eval "matrix_${i}_${j}=$((i+j))" || exit 1
: $(( j+=1 ))
done
i=$((i+1))
done
$ time bash many_vars_bench.sh
5.60user 0.12system 0:05.78elapsed 99%CPU (0avgtext+0avgdata 57636maxresident)k
0inputs+0outputs (0major+13020minor)pagefaults 0swaps
$ time dash many_vars_bench.sh
40.75user 0.14system 0:41.22elapsed 99%CPU (0avgtext+0avgdata 19972maxresident)k
0inputs+0outputs (0major+4951minor)pagefaults 0swaps
Dash was ~8 times slower. Increase the side of the square "matrix" for a proportionally bigger slowdown (this one uses 250003 variables).
> One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
Yes, launching a new process is generally expensive and so is spawning a subshell. If the shell is something like Bash (with a lot of startup/environment setup cost) then you'll feel this more than something like Dash, where the whole point was to make the shell small and snappy for init scripts: https://wiki.ubuntu.com/DashAsBinSh#Why_was_this_change_made...
In my limited testing, Bash generally came out on top for single-process performance, while Dash came out on top for scripts with more use of subshells.
Because all shell variables in code generated by pnut are numbers, variables never contain whitespace or special characters and don't need to be quoted. We considered quoting all variable expansions as this is generally seen as best practice in shell programming, but thought it hurt readability and decided not to.
If you think there are other issues, please let me know!
> assuming the resulting shell script is as inscrutably as binary executable
It's quite the opposite, pnut generates shell code that's close to the original C code to make it easy to audit the code. A useful way to see pnut is as a tool that rewrites C code to POSIX shell, without significantly changing the structure.
This means that even if GCC is required for the initial compilation of pnut (GCC compiles pnut, then pnut compiles itself and we get the pnut-sh.sh script), the script can be "sanitized" from trusting trust attacks by simply comparing the script to the C code and making sure GCC hasn't introduced any malicious code.
Page 10 of the SLE24 presentation has a tombstone diagram showing the compilation steps to go from pnut's C code to a GCC binary: https://github.com/udem-dlteam/pnut/blob/main/doc/presentati...