Making Wrong Code Look Wrong

raganwald · on Aug 22, 2011

Folks, please do not fall into the simple error of arguing whether Haskell’s typing beats Hungarian notation. This would be bikeshedding. The safe/unsafe strings are given as an example, the larger point he is making is about using coding conventions to highlight potential semantic errors.

Every program is going to have some kind of semantic problem that cannot easily be fixed with compile-time analysis. Unsafe strings in web applications are merely one example for one class of programs written in one class of languages. If you grasp what he’s saying about coding conventions, ignore the example.

So to his actual point: Do you agree with my contention that there are always going to be some semantically wrong bits of code that cannot be automatically detected by the compiler, or that it is awkward to do so?

If so, do you think coding conventions help more than they hinder? In other words, do they increase code safety without imposing too much of a burden on readability?

kragen · on Aug 22, 2011

> Do you agree with my contention that there are always going to be some semantically wrong bits of code that cannot be automatically detected by the compiler, or that it is awkward to do so?

It seems to me that answering this question is more or less exactly "arguing whether Haskell’s typing beats Hungarian notation," although of course Haskell type systems aren’t the only possible alternative. For example, any number of templating systems in languages like Python use late-bound method calls and run-time type checks, instead of static typing, to avoid cross-site scripting errors.

The problem, to me, is that Hungarian notation (and other coding conventions) only helps you detect semantic errors in cases where the errors can be detected by shallow, local analysis — exactly the kinds of errors that can be detected automatically by a program, either at compile-time or run-time.

So, what do we do about the deeper errors? Well, I don't know. I don't think abandoning coding conventions is a solution — I still want the code I read to be written in a predictable style, so that the variation in it is semantically meaningful — but I don't think coding conventions can do a better job than software of detecting errors.

However, if I were trying to write a GUI on a 1-MIPS machine with 1 MiB of RAM in the late 1980s, it might work better to do a bunch of that checking by hand (with Hungarian) than to try to invent a better programming language that does the checking for me automatically.

kolektiv · on Aug 22, 2011

OK, reasonable point, I had only mentioned that example in another reply because that was the one he used. Do I agree that there will always be semantically wrong bits of code that can't be detected by a compiler (whether as part of a type system or otherwise)? I'm not sure I do. OK, sure, it's probably possible to come up with some contrived examples that fit the case, but generally? No, I think that most problems are more tractable if looked at as a language design issue (be that literal language design, some form of internal DSL, or other) than as a convention issue.

And surely bringing up something like Haskell is germane? The article implies that this is the best/only/right way of dealing with issues like this, when that definitely isn't true for languages with a different/stronger type system, as one case.

Of course, this is pure opinion and supposition on my part - I suspect a proof either way is NP-BloodyImpossible. I do feel like coding conventions such as those he suggests make code harder to read, even if only a little bit. And that's ignoring IDE support as well - I feel the same if I'm in Emacs. If there are enough things in scope that I can't hold in my head the relevant [meta] information about them for that usage, that's a problem for me, probably of quite a different kind.

raganwald · on Aug 22, 2011

If there are enough things in scope that I can't hold in my head the relevant [meta] information about them for that usage, that's a problem for me, probably of quite a different kind.

This is a brilliant observation, thanks. Reminds me of Yegge’s observations about large codebases being like dirt, and the more dirt you have, the more machinery you need for pushing dirt around, and the more people, and the more systems for managing people and dirt-moving machines, ...

And the better way forward might simply be to have less dirt.

http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.ht...

vilya · on Aug 22, 2011

> Do I agree that there will always be semantically wrong bits of code that can't be detected by a compiler (whether as part of a type system or otherwise)? I'm not sure I do.

Here's a very simple counterexample that I don't believe it's possible for a compiler to detect: writing a - b when you should have written b - a. More generally, that applies to any non-commutative function which takes two arguments of the same type.

This is also a good example of where a coding convention could help: if those variables had been named more descriptively it would have been much easier to see which way around they should go.

ajanuary · on Aug 22, 2011

I'm sure I'm just missing something, but I'm struggling to come up with a reasonable case where you couldn't just have a and b be different types and provide an operator for typeof a - typeof b but not typeof b - typeof a (or vice-versa).

Of course, that would upset Joel on account of not knowing what - did without knowing the types.

masklinn · on Aug 22, 2011

Depends what you mean by "reasonable" I'm guessing. Manually encoding a semantic analysis would have a very high overhead, and you might be able to do it via dependent types instead (not sure though, they tend to hurt my brain when I try to actually understand them), but I don't think they're considered practical (yet).

JadeNB · on Aug 24, 2011

It's maybe not very realistic, but you'd have trouble with this approach if you were trying to code the distance function `(a, b) \mapsto |a - b|` in a language that didn't have an absolute value.

vilya · on Aug 23, 2011

Just because you can, doesn't mean you should. :-)

Forcing a and b to be different types could cause a different problem - maybe a worse one, depending on how many subtractions your code requires.

yummyfajitas · on Aug 23, 2011

This is also an error which cannot, as far as I know, be detected by any coding convention. I'd suggest that if you are trying to prevent such errors with a coding convention rather than tests, that's a process smell.

In practice, I find it best to avoid non-commutative functions with two arguments of the same type. The obvious choice (for a DB I'm writing):

    addRecord :: ByteString -> ByteString -> IO Bool
    addRecord key value = do ...

But this is safer:

    addRecord :: DBKey -> DBValue -> IO Bool

Now `addRecord value key` won't compile.

vilya · on Aug 23, 2011

> This is also an error which cannot, as far as I know, be detected by any coding convention. I'd suggest that if you are trying to prevent such errors with a coding convention rather than tests, that's a process smell.

Coding conventions don't detect errors, they just help to make them more obvious. For instance, if the variables had been named "cake" and "slice", it probably would have been a bit more obvious which side of the '-' sign each one should be on.

In any case, it's not an either-or choice between coding conventions and tests. A sensible developer will have both.

> In practice, I find it best to avoid non-commutative functions with two arguments of the same type.

I guess there's a silent "..where possible" attached to that? :-) There are many times where it's not possible, or at least not sensible; and having too many type conversions in your code can be just as bad for readability as having too few.

sophacles · on Aug 23, 2011

It is arguably a coding convention to say "make new types and appropriate conversions to help ferret out errors, even when they are just aliases to existing types". I see what you are doing here as fundamentally the same trick that the OP article does, albeit with some nice static analysis thrown in to help.

perlgeek · on Aug 22, 2011

> Every program is going to have some kind of semantic problem that cannot easily be fixed with compile-time analysis. Unsafe strings in web applications are merely one example for one class of programs written in one class of languages. If you grasp what he’s saying about coding conventions, ignore the example.

I disagree. If the example has a better solution (and in some other comment here I argue it does, though it's not at compile time), it's well worth looking if that solution could be transferred to other problem domains that could be used as examples instead.

The deeper issue is that naming conventions are a kind of automatism, and humans shouldn't be bothered with that, they are simply not very good with this type of automatism. For repetitive patterns, automated solutions should be used.

Advocating these repetitive type prefixes reminds me of the classical Design Patterns -- they might be the best option if you're stuck with one set of tools, but IMHO it's important to take a broader view and see if there are maybe other tools that can alleviate the need of such patterns. And indeed, in the case of the classical Design Patterns, there are. More powerful languages render some of them obsolete, or turn them into trivial 3 line solutions that you can hardly call "patterns" anymore.

raganwald · on Aug 22, 2011

The deeper issue is that naming conventions are a kind of automatism, and humans shouldn't be bothered with that, they are simply not very good with this type of automatism. For repetitive patterns, automated solutions should be used.

I agree, and this is what I am trying to get at with my comment: The deeper issue of whether naming conventions are a good way to make code better is much more interesting and important than the question of how to sanitize strings in a web app.

Dn_Ab · on Aug 22, 2011

I think what the Haskell people are doing is mostly criticizing the way things are. They are saying instead of making do with a fudge that won't be followed nor applied consistently why not mechanize as much as possible into the tools?

When there is no choice you must make do but amongst those with choice, why not look for better tools? Haskell monads have already been mentioned. I await easy to use dependent types and or statically verified contracts. But till then many powerful but relatively simple though different concepts can already be used today. These give static leverage over many semantic issues. Giving the act of building a program a flavour of the elegance in Euclid's Elements.

As noted, monads are useful and available in any with parametric polymorphism or dynamic types * . Haskell has Generalized Algebraic Datatypes. Pattern Matching Views is available in Haskell,ocaml,f#, scala. predicate dispatch is mostly there in clojure. People should be doing as much as possible to reduce errors and mechanize as much as possible.

*I'm aware monads become really powerful with higher kinds and that you give up some guarantees when you drop purity and even more when you go dynamic but it still remains a powerful and useful mechanism due to the body of theory backing it.

JadeNB · on Aug 24, 2011

> Giving the act of building a program a flavour of the elegance in Euclid's Elements.

This is the most gorgeous description I have ever seen of this approach to programming.

kisielk · on Aug 22, 2011

Theoretically in a strongly and statically typed language like Haskell shouldn't it be possible to fix the safe/unsafe string problem by having separate a string type for unsanitized input?

Not that I've tried this approach myself, but I'd be curious to hear from anyone who has.

raganwald · on Aug 22, 2011

Yes I have implemented something like this and I still do this from time to time. For example, I don’t always use plain strings for things like SSNs/SINs and phone numbers, sometimes I create a brand new class.

You can do this in any language today by boxing strings. In an “untyped” language like Ruby, you’ll get runtime errors if you try to use an UnsafeString where you expect a String, and in a statically typed language you’ll get a compiler error.

Some languages make this easier than others, but the main problem is not whether you can use types or boxes to solve the problem. The main problem is that if you are using libraries/frameworks that doesn’t support your types or boxes, you still have to be careful to place boxing and unboxing code where your code interacts with the library.

Taking a web framework as an example, if its template system expects PUS (plain unsafe strings) and it provides post parameters as PUS, it’s up to you to box things up so that your code never passes PUS straight through the system. I agree that once you box up a PUS it won’t get inserted into a template, but the pain point is that library authors don’t implement anything like this so you are going to have to deal with the rather broad interface between your code and almost every library that does something with a string.

Which is not to say that types are or are not better than coding conventions, but rather to say that in theory types are an airtight solution in any language, but in practice there is still broad scope for human error thanks to libraries.

eru · on Aug 22, 2011

Yes, that's possible in Haskell. And it's actually one of the canonical examples.

And even in C you could wrap your (char*)s up in different structs. The typing overhead would be higher than in Haskell, though.

(There's even a language extension to allow you to overload string literals. But you probably don't want that, because

    s = TaintedString "My string literal"

is easier to read than:

    s :: TaintedString
    s = "My string literal"

The overloading makes more sense for implementation details like using a different String implementation like ByteStrings instead of a linked list of Characters. You actually want to be able to see the semantic differences.)

masklinn · on Aug 23, 2011

Re. second one, wouldn't it also be possible to write

    s = "My string literal" :: TaintedString

?

It's not shorter than the constructed version but it should work, you sometimes need it to get the "right" kind of number out of a literal when the type inference gets confused (or more commonly in the REPL)

eru · on Aug 23, 2011

Yes, that's possible with the right compiler extensions. For this example it's essentially the same thing as giving the type on its own line, so I chose the more common syntax.

JadeNB · on Aug 24, 2011

> Not that I've tried this approach myself, but I'd be curious to hear from anyone who has.

Tom Moertel (http://blog.moertel.com/articles/2006/10/18/a-type-based-sol...) has.

stonemetal · on Aug 22, 2011

The right set of coding conventions can improve readability and write ability. Like when he is talking about the conversion functions in the article, the cognitive load of XfromY is quite low. You don't have to consult the documentation to find out what it is called, or how it works, etc. You don't have to think about the name when writing it or how it should work. In a setting without conventions that function could have a dozen names, another dozen ways to arrange the args, and do error handling. There would be no way you would call it with out consulting the docs.

As far as encoding semantics in to the type system, it would be great and help eliminate bugs, it wouldn't do a single thing to reduce the mental overhead out lined above. Basically they solve the same problem by two different paths, one makes doing the right thing easy, the other makes doing the wrong thing hard. Full semantic analysis probably isn't possible in a reasonable way, but catching the low hanging fruit should be possible.

endlessvoid94 · on Aug 22, 2011

> If so, do you think coding conventions help more than they hinder? In other words, do they increase code safety without imposing too much of a burden on readability?

I think there's a relationship between the number of people working on a codebase and how effective/burdensome coding conventions are. The more people on a project, the more friction there will be (at first) getting everyone to adhere to conventions. And this assumes there are code reviews in place.

I think this curve happens starting much lower than you'd think, too. Good hackers will be picky about conventions as soon as a codebase reaches ~4 hackers.

Thoughts?

yesbabyyes · on Aug 22, 2011

I think this curve happens starting much lower than you'd think, too. Good hackers will be picky about conventions as soon as a codebase reaches ~4 hackers.

I don't think so. I wouldn't have any problems working with GNU Style or Apps Hungarian if it was an interesting project.

tedunangst · on Aug 23, 2011

I would much rather work on a project with just about any convention, no matter how disagreeable, than a project with four different pet conventions that varies from file to file.

endlessvoid94 · on Aug 23, 2011

Yep, me too.

ori_b · on Aug 22, 2011

I'd have to take some time to think. I'm not convinced that coding conventions can solve any semantic problems that a good type system couldn't also fix.

ori_b · on Aug 22, 2011

Following up, I'm almost certain that a rigorously followed coding convention is entirely equivalent to a sufficiently powerful type system. The coding conventions are statically fixed at compile time, and can therefore are decidable at compile time.

One possible advantage that conventions have is that they can be easier to understand, and they can be transparently and painlessly broken when they don't make sense.

In other words, they're equivalent to a weak (ie, can have holes punched in it), informal type system.

jws · on Aug 22, 2011

My combinatorial explosion detector is going off.

Encoding a 'u' into variable names for "unsafe" strings and omitting it for ones that have been encoded safely for HTML bodies is nice. If this is the only problem to solve and you can universally apply this technique then it solves the "NUL terminated string of the 2010s", and I may use it but…

What about encoded safely for URLs?

What about encoded safely for use as a Unix file name?

What about encoded safely for SQL? (yes, don't do that, I know).

How about encoded safely for 7 bit ASCII only applications?

How about the output of base64? It isn't unsafe for HTML, but it is for URLs.

What about that bit of HTML that the user entered using their WYSIWYG field that needs to be sanitized but not encoded? 'u' seems right, but Encode(uVar) isn't the proper handling.

How about that library that doesn't know the convention, so its function calls all look like they are encoded safely for HTML since they don't start 'Us'?

What about the library that used 'u' for UTF-8?

(Oh, and…

  dosomething()
  cleanup()

… is broken. Requiring a caller of dosomething() to remember to cleanup() is about as polite as leaving set bear traps in your living room furniture and reminding guests to check under the cushions before they sit. Programmers are used to that sort of abuse (Remember to fclose() what you successfully fopen()), but the future should really provide use with better constructs, and it not being 1980 anymore, we are the future.)

masklinn · on Aug 22, 2011

> the future should really provide use with better constructs, and it not being 1980 anymore, we are the future.

And of course, the 80s already provided better constructs[0][1]

[0] http://www.lispworks.com/documentation/HyperSpec/Body/s_unwi...

[1] http://www.gnu.org/software/smalltalk/manual-base/gst-base.h...

al_james · on Aug 22, 2011

Yes, exactly, the meaning of 'safe string' depends on its context.

I personally believe that all code that is sensitive to escaping issues, should be escaped on output (or storage / sending to the database / whatever) by default unless you explicitly opt it out.

In our framework, we maintain 'already escaped' strings as a separate class, forcing the developer to acknowledge this fact. The HTML output layer escapes all strings unless they are instances of this safe class. Similarly for the SQL layer, all code gets quoted unless it was explicitly marked as 'safe'.

yesbabyyes · on Aug 22, 2011

With eco templates, I output strings with <%=. To output unescaped strings such as templates, I must write <%-. This works for me.

eru · on Aug 22, 2011

Do you treat SQL queries as strings in your code, or do you build something like an abstract syntax tree?

al_james · on Aug 23, 2011

Build them using a syntax builder framework, but it is possible to pass raw SQL strings in in exceptional circumstances. Obviously, the developer needs to be totally aware of the risks of this. Not ideal, but needed to solve a couple of problems.

tedunangst · on Aug 23, 2011

    urlTheLink
    filename
    sqlSomeQuery
    asciiText
    b64DataBlob
    ...

You're allowed to use prefixes longer than just one letter.

masklinn · on Aug 23, 2011

Except now the combinatorial explosion gets even worse. How do you prefix a b64 blob which has been SQL-escaped, urlencoded and htmlencoded? `urlhtmlsqlb64String`?

JadeNB · on Aug 24, 2011

It seems like 'combinatorial explosion' might not be the right phrase here. There is a combinatorial explosion of possible prefixes, but, since you're not going to search the possible-prefix space, it doesn't really matter. (For example, there is a combinatorial explosion of unprefixed variable names, and nobody's bothered by that!) The question is whether you can decode a given prefix, and, even in your example, that's easy. (Prefix ambiguities could easily be resolved with underscores, like url_html_sql_b64_string.)

I'm not arguing for this convention (I don't much like it myself); but I don't think your argument is a very strong point against it.

EDIT: Oops, sorry, I just noticed that it wasn't you who brought up the explosion in the first place.

tedunangst · on Aug 23, 2011

uh, why would you have one of those except to crap on the idea? have you ever had a sql and html escaped string?

masklinn · on Aug 23, 2011

No, but I have had b64 urlencoded htmlencoded strings.

kolektiv · on Aug 22, 2011

Mmmm, but there's more than one way to skin a cat. To take just the point about safe and unsafe strings - well, you could do that with naming conventions, sure. But then you've got a peer review issue. It helps, but it's not great. So the next thing (in a language which lets you use types in this way) is to create a type of UnsafeString, or SafeString etc. And your language doesn't even have to be that great to let you do this - just having a simple type system. And now you change your write method so it doesn't take a String anymore, it takes a SafeString, and only a SafeString. And you make it so that constructing a SafeString can only be done by Encoding at some point.

Now you have something which can still look clean (how clean may depend on your language, type inference, etc.) without bringing in messy pseudo-Hungarian-ness, which stops me doing the most important thing with my code - reading it easily.

Obviously it's only a silly and trivial example, but I'm not sure this article has dated that well in some regards (or perhaps only really applies to certain types of languages - Visual Basic, for example, pretty much forces you to do something like Hungarian to maintain any semblance of sanity long term).

eftpotrm · on Aug 22, 2011

The downside of that approach though is that you end up wrapping the bulk of the API functions you're using to make them conformant. You've got a huge amount of extra work creating and maintaining these components, and a major training workload for any new hires so they know why they can't use the standard options they're used to and what the internal replacements are.

Personally, I'd rather go Hungarian.

Edit - well, yes, you may well be able to do this sort of thing nice and easily in Haskell, but I'm not sure that going with Haskell to avoid the training and maintenance overhead of complicating your Java (or whatever) is a net reduction in workload and hiring difficulty...

yummyfajitas · on Aug 22, 2011

This is a Java/Python/etc problem, not a fundamental one. In Haskell, the wrapping process is merely an application of liftM:

    type UnsafeUserGenerated = UserGenerated a
    instance Monad UnsafeUserGenerated where
        ... (monadic boilerplate skipped)...

    apiFunction :: String -> String

    funcOfUnsafe :: U -> UnsafeString
    funcOfUnsafe unsafe = (liftM apiFunction) unsafe

pavpanchekha · on Aug 22, 2011

Does it really have to be a monad? Isn't a functor what we're really looking for?

jrockway · on Aug 22, 2011

It depends; if all you want to do is lift normal functions to the domain of unsafe operation (where an unsafe input implies unsafe output), then sure:

    newtype Unsafe a = Unsafe a
    
    instance Functor Unsafe where
        fmap f (Unsafe k) = Unsafe . f $ k

    addTwo :: Int -> Int
    addTwo = (+2)

    unsafeAddTwo :: Unsafe Int -> Unsafe Int
    unsafeAddTwo = fmap addTwo

But really, I'm not sure this is the right approach. Even values generated inside my program need to be quoted for inclusion on an HTML page. What you want to avoid is double-quoting, so what you need is simpler:

    data Content = Quoted String | Unquoted String

    output :: [Content] -> Content
    output = concatMap f
       where f (Unquoted x) = quote x
             f (Quoted   x) = x

Now the type system ensures that (output . output) == output, which is what you really want to ensure. Tainted data, I think, is a separate concern. And the solution, in that case, doesn't involve a functor, it involves making sure your library tags everything as Unsafe and that your data validation functions remove that annotation:

    type Params a = Map String (Unsafe a) -- keys may also be unsafe, YMMV

    readHtmlForm :: Request -> Params String
    validateField :: Validatable a => Unsafe a -> a

    main = output . Unquoted . validateField . get "foo" . readHtmlForm <$> fakeHttpRequest

Bootvis · on Aug 22, 2011

That comment illustrates the problem with Haskell. Yes it's nice and logical but for some reason also very hard.

pavpanchekha · on Aug 23, 2011

This comment illustrates the problem with comments about problems with Haskell. It doesn't quite understand what it's complaining about.

Functor vs. Monad in this case is a way of talking about how exactly this construct should work --- it's not a meaningless distinction. As another response to my comment mentioned, both a functor or a monad are applicable. The question is whether two instances of this data type (I'm using non-Haskell-y terms for clarity) have influence on each other (in rough terms). So it's not that I was complaining that the parent was wrong. I was complaining that the semantics he was imposing on his quoted strings were too restrictive. Another poster instead mentioned ways in which mine were too loose. So my post was part of a constructive debate on what exactly we want quoted strings to do. That Haskell provides a vocabulary for communicating precisely and tersely is not it's fault, it is one of its strengths.

jrockway · on Aug 22, 2011

Any monad is also a functor, so they're both right!

eru · on Aug 22, 2011

Actually, he probably wants an applicative functor, which sits between functors and monads.

Or perhaps an arrow or a co-monad are better? ;o)

eru · on Aug 22, 2011

> [...] but for some reason also very hard.

Yes. But that's a feature, not a bug.

gmartres · on Aug 23, 2011

It's not hard, it's different.

yummyfajitas · on Aug 23, 2011

The way I envisioned using it, yes. The use of a Monad over a Functor is merely because I'm still very much a Haskell newbie.

kolektiv · on Aug 22, 2011

As siblings point out, how much work you have to do depends on your language and the available facilities. In Java, this might be a huge pain - Haskell, etc. less so. It's more likely that this is done as part of the framework or libraries generally - how many people are, to continue this example, writing their own web framework? In .NET for example, ASP.NET MVC includes an HtmlString class for similar purposes. Other languages can approach it in different ways. You may find as well that even with something like .NET, not the most expressive of type systems, you could still make life fairly easy by providing appropriate implicit type conversions in one direction.

Symmetry · on Aug 22, 2011

In most languages that support this sort of things, can't you use SafeString in places where String is expected, as long as you declare SafeString as a sort of String?

prodigal_erik · on Aug 22, 2011

I wouldn't want SafeString is-a String, I would want a conversion from SafeString to String that removes whatever encoding or escaping was applied. Otherwise you end up re-encoding values that unbeknownst to you don't need it, which is why there are thousands of terrible PHP bulletin boards out there which won't let you use quotes without mangling them.

mseebach · on Aug 22, 2011

In Java, String is a final class.

wazoox · on Aug 22, 2011

I'd just like to mention that perl, which is often underrated, has a "tainted" mode that checks at compile time that you sanitized all inputs, and refuse to run if it isn't the case:

http://perldoc.perl.org/perlsec.html#SECURITY-MECHANISMS-AND...

joeyh · on Aug 22, 2011

I used to think that was pretty cool, until I realized it was probably developed well after Haskell. :) It can be of help if you're passing a lot of input to system() etc; it's not really general enough to help with web programming.

I gave up on perl's taint mode when I discovered this bug http://bugs.debian.org/411786 , in which perl randomly sets the taint flag due to a utf8 bug.

JadeNB · on Aug 24, 2011

According to Wikipedia, Haskell 1.0 was defined in 1990 (http://en.wikipedia.org/wiki/Haskell_%28programming_language...), but Perl has had taint mode since v3 in 1989 (http://en.wikipedia.org/wiki/Taint_checking#History).

Anyway, I'm not sure why it would be less cool even if it were inspired by another language (which I'm sure it was; Perl, like English, elevates borrowing to an art form).

wazoox · on Aug 22, 2011

Well, it's a bug. It's not supposed to happen, and AFAIK it occurs only for some (one?) specific version...

masklinn · on Aug 22, 2011

> a "tainted" mode that checks at compile time that you sanitized all inputs

Considering how dynamic perl is, and that you can mix tainted and non-tainted values in a single collection (for instance), I don't see how a perl program could be statically analyzed for taintedness misuses.

jrockway · on Aug 22, 2011

The program dies at runtime if the runtime detects a misuse of tainted data.

The reality is that nobody uses taint mode, though, for whatever reason. If you look at my comment up the page, the problem that people have is not managing the safety of data, it's making sure that they present the right "view" of that data to the right component. HTML needs to be escaped, but not if it's already been escaped, and so on.

wazoox · on Aug 22, 2011

> The reality is that nobody uses taint mode, though, for whatever reason.

Uh? I do use it; it's extremely efficient. I'm probably not the only one :)

masklinn · on Aug 22, 2011

> The program dies at runtime if the runtime detects a misuse of tainted data.

Right, so it's not at compile time. Thank you.

jtolle · on Aug 22, 2011

Your point about language mattering is spot on. If your language has the right features (even just `typedef`), I don't think you need any flavor of Hungarian. In the right context, though, it can be useful.

For example, I do a lot of Excel+VBA programming. In VBA, the native data type is `Variant`. And if you pass something from Excel to a VBA function, that's what you get. But you also know that legal Excel values are just a subset of what VBA allows in a `Variant`. So I find it very useful to stick a clue to myself in the variable name, so say, `f(parm)` becomes `f(xParm)`.

I have libraries that work with the `x` subset of `Variant` (and subsets of that - i.e. `xs` denotes a simple value that can go in a single Excel cell). The prefixes provide useful clues that the language can't reasonably help me with.

CJefferson · on Aug 22, 2011

I agree with all of this, except the operator overloading rant. Unless of course we extend it to forbiding overloading of any function name for more than one type.

I have done a bunch of mathematical programming in java, and you end up writing ".add()" a lot. Maybe .add is on a finite field. Maybe it is some matrices. Maybe complex numbers. I don't see what I am saving over operator overloading.

Of course, if I used .addFiniteField, .addMatrix, .addComplex, then I can see the (possible) benefit, of knowing how "heavy" the operation is at a glance. However, I don't think it is worth the cost, given I lose genericness.

robfig · on Aug 22, 2011

It sounds like you would use operator overloading for exactly what it was meant for (and good at)!

I think the usual complaint is that people use operator overloading for non-mathematical operations. For example, a guy doing a webapp might overload the "+" operator for Group and User as a clever way to add a User to a Group. This turns out to be a terrible idea.

In fact, my eyes glaze over as I read Scala code for this very reason, because everyone and their mother defines a method with some random array of symbols because it makes sense to them and they prefer shorthand. It makes the code totally unreadable.

scott_s · on Aug 22, 2011

That's the perennial complaint and worry, but in my experience, people use operator overloading in C++ for mainly two things: function objects by overloading operator(), and iterators by overloading the dereferencing operators and increment operators.

xyzzyz · on Aug 22, 2011

At any given place, people use at most 30% of C++ features -- the bad thing is that these 30% subsets usually overlap only a little. People use operator overloading in C++ for all the crazy reasons, like I/O, for instance -- for even better (or maybe worse) examples, see boost, boost::lambda will be a good starting point.

Good abstractions are good, but operator overloading combined with templates and implicit casting is not one of them.

scott_s · on Aug 22, 2011

shrug

People always talk about it what can happen, but I've never seen instances of operator overloading in C++ that I objected to in a fundamental way.

xyzzyz · on Aug 22, 2011

iostream, for one. It's a real pain to deal simultaneously with this fancy bit-shift I/O syntax on the one hand, and i18n on the other -- fancy syntax is way inferior to format strings when you want to support multiple languages.

Also, please, take a look at the boost::lambda or other boost libraries -- they have large DSLs written on top of operator overloading which look nice when you look at them, but are terribly complicated to use and ridiculously hard to debug, thanks to the completely unintelligible messages that the compilers produce.

Most probable reason you've never seen it, is that most sane, experienced people know the danger and avoid it. When there's more than one person working on a project, it's way more important for the code to be easy to follow and debug than to look fancy.

scott_s · on Aug 22, 2011

I've made heavy use of Spirit, see: http://news.ycombinator.com/item?id=2912729

You are correct in that Spirit is difficult to debug, but that has nothing to do with operator overloading. That has to do with its extensive use of template metaprogramming and C++'s lack of concepts (http://en.wikipedia.org/wiki/Concepts_(C%2B%2B)). If they switched the whole library over to using named member functions, the debugging problems would still be there. And concepts would go a long way to preventing the compiler problems.

I've looked at boost::lambda, and decided it was mostly a novelty. How far towards a real lambada could you get without lambda support in the compiler? That far, it turns out, which was not far enough. The rules I had to remember to use it in non-trivial situations was more effort than it required to write a one-off function object. While I agree boost::lambda is more trouble than it's worth, I haven't seen or heard of terrible things happening because people used it.

I also think it's worth noting that many Boost libraries are purposefully on the edge of what is possible in C++. It's like a refereed sandbox for experimenting.

I've also had no problems with << and >> as the stream read and write operators. But, I also have no experience with internationalization. I don't see any obvious problems, though, what are they?

It's possible I've seen about the same amount of C++ as you (perhaps in some different areas) and come to a different conclusion.

xyzzyz · on Aug 23, 2011

While I agree boost::lambda is more trouble than it's worth, I haven't seen or heard of terrible things happening because people used it.

And this is precisely my point: nothing bad could happen because people haven't used it in serious projects, even though it is possible -- it's just more trouble than it's worth. This is actually the whole point of argument against operator overloading: it is troublesome for both implementors and users, and the value provided is minor - most people use operator overloading for math-like types (bignums, complex numbers, matrices) anyway. On the other hand, it requires (sometimes considerably) more work on the designers' and implementors' side, and more caution and experience on the programmer's side.

That said, I'm not against operator overloading myself. However, I'm heartily against operator overloading in C++, because it interacts really, really bad with other C++ features (especially implicit casting rules (which are bad anyway) and manual memory management). Please, see [1] for more information, it's explained better and wider there than I could do it here.

Regarding bit-shifting I/O, I've never figured out why they even thought it's a good idea at all. It isn't any more terse. If it's combined with output formatting it's much harder to figure out what's actually going to be written than with printf-like functions, because the formatting and expressions are mixed together. The only advantage I can think of is that it can be a little faster than parsing the format string every time. But still, you don't need to parse it every time. If the language design allowed it, they could have introduced something like CL's define-compiler-macro, but I think the utter mess created by presence of such tool in C++ would be overwhelming -- it's bad enough as it is, with template "metaprogramming" instead of real metaprogramming facilities.

The internationalization issues I mentioned make this completely useless: the usual way to support multiple interface languages is to use something like gettext for strings in code (see e.g. Qt Linguist for C++ implementation). Basically, for every string that's supposed to be user visible, you use a special translating function, i.e. instead of

printf("Hello, %s, it's %s!\n", name, day_of_week);

you write

printf(tr("Hello, %s, it's %s!\n"), name, day_of_week);

This simple approach does not map at all to the bit-shift I/O -- I'm fairly sure that experienced C++ hackers could create something along the lines of

cout << (translate << "Hello, " << name << ", it's " << day_of_week << "!\n");

but this is obviously useless because obviously not all (not even many) natural languages has syntactical structure of English. Better option would be:

cout << (translate("Hello, %{1}, it's %{2}!\n" << name << day_of_week);

but again, it's clearly only printf with unnecessarily convoluted syntax.

[1] -- http://yosefk.com/c++fqa/operator.html

CJefferson · on Aug 22, 2011

Yes, that aspect of operator overloading is nasty. I think one thing that makes operator overloading worse is that there are few reasons to use 'add()' rather than 'addUser()', but once you decide to use operator overloading, there are limited options.

C++ libraries which do abuse operator overloading (see the boost::spirit parser library as the best example) have this problem. For example, spirit wants to define the standard '' and '+' parsing operators. However 'a+' and 'a' aren't valid C++, while '+a' (unary +) and '*a' (dereference) are, so those are used.

scott_s · on Aug 22, 2011

Spirit doesn't bother me, although it's a glaring exception I did not think of to my above response. Its use of operator overloading doesn't bother me because that's the whole point of the library; you know what you're getting into ahead of time. It let me write EBNF code directly in C++: https://github.com/scotts/cellgen/blob/master/src/cellgen_gr...

The Spirit code is unlikely to be mixed in with other code. Or, put another way, I was never bothered by your complaint, because I always read it as EBNF notation first.

My real problem with Spirit was performance. It took three minutes to compile a 7000 line C++ program - more with optimizations turned on. The binary with debugging information turned on was 24 MB. And the enormous amount of copying it performed at runtime meant parsing 300 line files took a few seconds. (Luckily the runtime performance of my compiler was never important.) The new version may have fixed these problems.

eru · on Aug 22, 2011

Ocaml and especially Haskell have more sane policies towards operators.

You can make up your own operators. That's not always readable, but at least if you see some funny symbols, you know that you have to look it up.

With Haskell's typeclasses (and I guess Ocaml's Functors should allow the same) you can have the functions of the same name that work differently for different types. E.g. there's a Show-typeclass, whose instances have to provide a `show :: a -> String' function. The Num-typeclass includes operators such as `(+) :: a -> a -> a', which means that + always takes to arguments of the same type and returns the same. So you wouldn't be able to overload + to add a user to a group.

You might come across somebody writing

    (+!) :: User -> Group -> IO ()

for adding a user to a group (in an un-idiomatic way). But at least there's a funny symbol to alert you, instead of a re-used normal one.

ori_b · on Aug 22, 2011

Also know as "How to put a second type system into your language". Personally, I prefer to have the language type system handle it.

perlgeek · on Aug 22, 2011

Nonono, the real solution for the HTML escaping problem is to use a template system which defaults to escaping all variables. Using a naming convention is much less robust.

If you need to pass in something that doesn't need to be escaped, you switch off the escaping explicitly (`<% var name ESCAPE=NONE %>`), and/or you put HTML strings into a separate type which doesn't get escaped upon substitution.

For example the template engine would return a `HTMLString` instead of a `String` object, so that it never accidentally encodes twice (as could happen in the case of includes).

rlpb · on Aug 22, 2011

Taking this even further, an HTML templating system that involves supplying syntactic HTML as strings is fundamentally broken. Ideally you'd never provide a string that ends up as syntactic HTML. You'd build up a data structure that represents your output and a serializer would do the rest, escaping everything correctly since there would be a type-based distinction between HTML structure and element content. Genshi is an example of a templating system that works on data rather than syntactic string snippets.

perlgeek · on Aug 22, 2011

Live isn't always that simple.

For example sometimes you get a piece of HTML that you have to include verbatimely for policital reasons (ad banners come to mind that need to be included character by character because of the Terms of Service - see AdSense for example).

In that case you really need an option for direct, unescaped HTML inclusion.

Or you might need to hilight the a syntax that's hard to parse, and the only available hilighter spits out HTML pieces instead of parse trees.

My point is that a template system should be able to deal with such realities, and it doesn't make it "fundamentally broken".

nasmorn · on Aug 22, 2011

Also while you can build a company on the motto "People feeling uncomfortable with a HTML generation library have no business adding a list tag somewhere" there are certainly cases where it makes sense for them to do so.

olavk · on Aug 23, 2011

Folks, you are all missing the context of the article. He is writing in VBScript - a now obsolete stripped down version of VB. It was created as an alternative to JavaScript, but was much more limited in its OO capabilities. VBScript was kind-of-OO in that is allowed you to consume objects, but it didn't initially allow you to define your own classes.

So you are working in a language which don't allow you to defined custom types. You can work with built-in types like strings and integer, but if you want to add "metadata" to these basic values, you can't do it in the type system. Hungarian notation is a perfectly legitimate solution given those constraints.

The distinction between "Systems" and "Apps" Hungarian also makes sense in this context. Systems Hungarian encodes the name of the built-in types. This made sense in BCPL (where it originated), but is cargo cult in VBScript, where this is already expressed in the type system. Apps Hungarian allows you to express "custom types" which is not expressible in the type system of VBScript and therefore makes sense. But it is cargo-cult programming in any language which allows you to create custom types or objects. Which is basically any other modern language.

(The thing about exceptions being "invisible gotos" also makes sense in the context of VBScript, where exception handling was done through statements like "On Error Goto..." and "On Error Goto Next".)

lucisferre · on Aug 23, 2011

How in the world are people still making the front page with years old Joel posts? If you want to read this stuff go visit his blog and read it.

gwern · on Aug 22, 2011

For those curious, the way one would do this in a language with a better type system (or objects) would be some sort of opaque object/type which you can't peer into, and which provides a function/method to extract the String within and which also sanitizes it along the way.

So for Java, you might have a class with the String stored as a private variable and the obvious getter/setter methods. In Haskell, you'd have a newtype with similar functions. (Can this be done in C? I don't know.)

A close approximation to Joel's example (string sanitation) in Haskell with newtypes: http://blog.moertel.com/articles/2006/10/18/a-type-based-sol...

Cushman · on Aug 22, 2011

I've been doing something a little like this in CoffeeScript for a while. For those who don't know, CS adopts Ruby's style of string interpolation ("formatted #{string}"), and like Ruby, only applies it for double-quoted strings. My policy is to only ever compose strings using interpolation, and to only ever write double-quoted strings when they contain interpolation. String literals are strictly single-quoted.

That way, I know that single-quoted strings, which is most of them, won't ever involve user input, while double quoted strings get a little extra scrutiny. It's not XSS protection by itself, but it's a useful little way to increase the odor of the code one way or another.

blakeperdue · on Aug 22, 2011

Off the topic of unclean code, but there are some great Spolsky articles that are more business-oriented on Inc's site: http://www.inc.com/author/joel-spolsky