This rant misplaces its frustration. This is not a problem with unix filesystems, this is a problem with Bourne Shell scripts, and with UNIX argument parsing semantics.
Bourne shell is notorious for its problematic quoting, both of filesystem data and of any data from any other source. Every example in which he described a problem with a filename parameter could just as well be a problem with a non-filename parameter. The correct solution is to not program complicated scripts in Bourne Shell, and instead use a language which does not implement variable access by interpolating strings and then re-tokenizing and re-evaluating them. Examples of satisfactory languages include Perl, Python, Ruby.
Regarding UNIX arguments and the dash, it is an unfortunate aspect to the flag argc/argv/envp calling convention for unix programs. Some other operating systems provide more structure in their calling convention, explicitly separating different types of parameters from one another. This is both a strength and a weakness, as it results in a uniform yet inflexible systems interface. One of the greatest strengths of UNIX is that its calling convention is so flexible. The semantics used today are quite different from the semantics used 40 years ago -- yet execve() remains unchanged. I would encourage anyone interested to do a bit of historical digging here, and see how those more ridged system APIs fared over time.
Anyway, the solution to his initial question of using `ls` is the -- argument, which signifies argument parsing should be disabled for the remainder of argv: ls -- *
The correct answer to his dotfile/glob question is: "glob() and the Bourne shell do not have the semantics you're after. Do not use them, use readdir()."
The correct answer to his find -print question is: Yes, print's use of whitespace was a mistake, and it is a mistake repeated continually throughout the land of shell scripting and accompanying standard UNIX utilities. As he notes, it is why print0 was introduced. Making print0 standard is far easier than reworking filesystem semantics (and, reworking userland in this manner is a more complete solution as it addresses data integrity issues from non-filesystem inputs as well). If you want reliable, correct programs, do not write them in shell.
No, even "satisfactory" languages can suffer problems. For example a couple of days ago I discovered a nice exploit in the qemu-img program, and using any language to parse the output wouldn't help you:
Your link above shows an author who claims JSON output, yet the output is clearly non-validating JSON (toplevel is not a [] or {}, improper quoting, etc). It appears that instead of using JSON serialization, the author merely printed key/value pairs separated by the string ": ". The problems with this approach are obvious.
This is why using a proper serialization format is important.
If the author had done this correctly and used a proper JSON library to produce this output, the following, completely safe result would have occurred:
The author probably would have been best served by YAML, which is more easily readable -- and which, like json, provides mechanisms to properly represent arbitrary data.
In any event, the discussion is severely confused. Ad-hoc buggy formats cannot be compared with well-formed JSON or YAML. This has nothing at all to do with the language.
You should probably read the link more closely. I'm advocating using JSON so that programs are able to safely parse the output of 'qemu-img'. At the moment there are many programs that parse the (current text) output, and they almost all have security holes as a result.
Yes it does - qemu-img is written in C. The two programs we found exploitable were written in Python and C. They are written in "satisfactory" languages. Bash is not involved. Yet both suffer exploits because of \n (and other) characters in filenames.
The issue you refer to is in a poorly formed, ad-hoc serialization format. It has nothing to do with representation of variables at runtime. It has nothing to do with the language.
It is a programming error, not an inherent flaw in the language.
That's incorrect. As was already pointed out this issue has nothing to do with reading data from the filesystem or manipulating variables internal to the program and everything to do with poor choices made when using printf.
In other words, those files aren't causing the QEMU program internals to re-interpolate one variable as two values. They're merely messing up a poorly written data exchange format.
Other languages such as I listed above simply do not have the same issue. The C program did not mis-interpret a variable as two separate values because it contained spaces. That is the nature of the danger with shell -- any reference to a variable in Bourne involves string interpolation and tokenization. This simply does not happen in C.
>The correct solution is to not program complicated scripts in Bourne Shell, and instead use a language which does not implement variable access by interpolating strings and then re-tokenizing and re-evaluating them.
The author also mentions a problem with the filesystem: what is the filesystem encoding? Do you treat filenames as blobs, or encoded strings? What do you do if you think the filesystem stores UTF-8 but there's a filename which has a byte sequence which is invalid UTF-8?
I was thinking about this some more. You are right, but both you and I missed the point.
I think the author is saying that the problem is the "unix filesystem" is actually filesystem that doesn't match the unix, where 'unix' includes sh/csh/bash shell and command-line arguments which start with '-'.
If the filesystem wasn't a broad in what it accepted ... and the author is trying to convince us that POSIX allows that ... then it would a unix filesystem which was a better match to unix.
Yes, Bourne Shell's variable access scheme is a bit ghetto, but to me the problem is that the shell is doing globbing at all. Why not have the shell pass "*" through to the program, and have the program itself perform globbing? Then filenames would have no impact on how the command-line is parsed.
Because that's how MS-DOS used to work, and it was dumb. It means every program has to do globbing (or often, didn't do globbing). In any case, bash does get this right: ls * will pass the correct filenames to the ls program no matter what the filenames contain. Also quotes around variable expansions can cope with any characters.
So what? If the primary API used by command-line applications to open files does the globbing, then programs will have to go out of their way to not glob. And you'll get the added benefit that globs will only be applied to arguments that are actually meant to specify filenames. There would be none of this escaping "*" when you pass it to "find."
> In any case, bash does get this right: ls will pass the correct filenames to the ls program no matter what the filenames contain.
That doesn't solve the problem; your filename could be called "--help."
bash isn't interpreting '--help' at all, it is just passed on to the program being executed, and most GNU CLI programs conventionally interpret '--help' as a special option.
If your filename is indeed --help, the convention is to use '--' as the separator between your command line options and filenames. Anything after -- is not interpreted as a command-line option.
Another way would be to use a more qualified filename form ('./--help')
Because, as the author points out, different users may want different globbing behavior. Globbing is not performed identically between shells.
If the author so wished, he might trivially create his own shell and allow * to match dotfiles, with absolutely no disruption to the rest of his system. Or one could write a shell which uses a regex instead of a glob. Or the SQL LIKE query syntax. The possibilities are endless. Anyone is free to do this.
The fact is, the current globbing behavior in unix shells strikes a good balance between pedantic correctness and "what I really want." The author's frustration is due to his attempting to use a command line interface as a structured programming language.
A better, and universal, solution to the problem of filenames starting with - is to prefix all relative paths with ./. A path like ./-blah will never be misinterpreted as a command line option, regardless of the tool, and doesn't depend on the -- convention which is only inconsistently present.
Bourne shell is notorious for its problematic quoting, both of filesystem data and of any data from any other source. Every example in which he described a problem with a filename parameter could just as well be a problem with a non-filename parameter. The correct solution is to not program complicated scripts in Bourne Shell, and instead use a language which does not implement variable access by interpolating strings and then re-tokenizing and re-evaluating them. Examples of satisfactory languages include Perl, Python, Ruby.
Regarding UNIX arguments and the dash, it is an unfortunate aspect to the flag argc/argv/envp calling convention for unix programs. Some other operating systems provide more structure in their calling convention, explicitly separating different types of parameters from one another. This is both a strength and a weakness, as it results in a uniform yet inflexible systems interface. One of the greatest strengths of UNIX is that its calling convention is so flexible. The semantics used today are quite different from the semantics used 40 years ago -- yet execve() remains unchanged. I would encourage anyone interested to do a bit of historical digging here, and see how those more ridged system APIs fared over time.
Anyway, the solution to his initial question of using `ls` is the -- argument, which signifies argument parsing should be disabled for the remainder of argv: ls -- *
The correct answer to his dotfile/glob question is: "glob() and the Bourne shell do not have the semantics you're after. Do not use them, use readdir()."
The correct answer to his find -print question is: Yes, print's use of whitespace was a mistake, and it is a mistake repeated continually throughout the land of shell scripting and accompanying standard UNIX utilities. As he notes, it is why print0 was introduced. Making print0 standard is far easier than reworking filesystem semantics (and, reworking userland in this manner is a more complete solution as it addresses data integrity issues from non-filesystem inputs as well). If you want reliable, correct programs, do not write them in shell.