I got bit by that as well. I think testing common arguments is useful, but I wonder if it would make more sense to limit to a single argument. That would eliminate the need to handle permutations, and it would also keep the questions simpler.
I think to do this properly you need to implement the argument parsing the command is doing; e.g., use getopt(3C). After all, "ls -la -h --" is valid too.
Clearly the correct way to implement this is to run each command in a short-lived container and check the output matches what's expected. For things like the pager questions, it'll need to implement a full vte and vte comparison, so that should be fun.
For bonus points, one of the options to emulate linux in the browser could be used to do that more securely and without the need for a service... though the load time might increase by several minutes or hours.
> Clearly the correct way to implement this is to run each command in a short-lived container and check the output matches what's expected.
I'm not sure it is clearly correct. Some commands don't produce output, `mkdir` and `cd` for example (the first two in the animation). Furthermore they'd have to blacklist `echo` and other ways to produce the output against the spirit of the exercise... until an answer is, say, `echo $?` - at which point you'd require a preceding command exiting in a suitable way in order to keep the "just execute it" approach workable.