> So a lot of the parser code is fixing the HTML author mistakes. This is pr...

RyanMcGreal · on Feb 14, 2012

I'd argue that forgiving HTML parsing is one of the main reasons the web got as big and broad as it did.

http://quandyfactory.com/blog/39/the_virtue_of_forgiving_htm...

finnw · on Feb 14, 2012

I disagree. It's also what allowed the "IE-only web" to persist for about five years.

It might have been a good thing until about 1997, but at that point there was no shortage of people creating new web content, and raising the barrier to entry would have done no harm. And a lot of innovation in browser features might have happened sooner (due to increased competition between browsers.)

RyanMcGreal · on Feb 14, 2012

There was great competition between browsers through the 1990s. Unfortunately, they were competing through 'value add' proprietary extensions and browser lock-in, which is orthogonal to the issue of whether HTML parsing should be more permissive or more draconian.

flomo · on Feb 15, 2012

Somewhat. Microsoft reverse-engineered a lot of Netscape's rendering quirks, as well as adding their own. Being quite liberal it what it accepted certainly didn't hurt IE adoption. (And a lot of these things are now in black&white in the HTML5 spec).

RyanMcGreal · on Feb 15, 2012

Related: http://diveintohtml5.info/past.html

"[W]hy do we have an <img> element? Why not an <icon> element? Or an <include> element? Why not a hyperlink with an include attribute, or some combination of rel values? Why an <img> element? Quite simply, because Marc Andreessen shipped one, and shipping code wins."

flomo · on Feb 14, 2012

> The problem that killed XHTML was the draconian error handling in most browser.

I'd argue just the opposite. Browsers treated XHTML doctypes as "tag-soup" HTML4. Firefox would only validate the document if you used the xhtml mime type, in which case you lost progressive rendering and your site would seem slow to the user.

The key point here is that XML wasn't doing any favors to the browser vendors - their rendering model just didn't work that way internally.

Net result is a gazillion 'XHTML' documents which aren't actually XML. Now you can't start to throw up end-user warnings or half the web would appear to be broken. So, admit it was a dubious idea to begin with and start over.

pornel · on Feb 14, 2012

> What? Who knows. Nobody knows because you cannot, by definition create _wrong_ HTML 5.

HTML5 defines pretty strict conformance requirements for authors. That's separate thing from defining error recovery mechanisms for UAs.

You can easily learn what is wrong with your code using the W3C Validator.

http://validator.w3.org/check?uri=http%3A%2F%2Fpornel.net%2F...

which is a big improvement over the old DTD-based one which couldn't verify contents of attributes or structures more than one level deep:

http://validator.w3.org/check?uri=http%3A%2F%2Fpornel.net%2F...

gioele · on Feb 14, 2012

> HTML5 defines pretty strict conformance requirements for authors.

What you are referring to wrong as in is _non valid_, what I was referring to was _non working_.

Invalid HTML 5 _works_, so does invalid HTML. At no point your browser will stop and say "Come on, that is not HTML, that is garbage". If there is no such point, then there is no _wrong_ HTML 5.

Take this code, it is valid HTML 5 (may the XML gods forbid me)

    <!DOCTYPE html>
    <title>My feelings</title>
    I love HTML
    </html>

It will be shown without any problem by a browser. The title will be "My feelings" and the body will be "I love HTML".

The following is invalid HTML5

    <title>My feelings</title>
    I love HTML

Yet, it will be shown "correctly" by browsers without any problem, just like the previous one.

Once such a lax error recovery mechanism is in place _without additional warning in the UI_, how is one able to define what is wrong and what is correct?

pornel · on Feb 15, 2012

> how is one able to define what is wrong and what is correct?

There are many arbitrary lines there. There were huge bikeshedding debates on HTML WG just how much must be quoted, escaped, declared and closed.

Generally the correct/valid subset is chosen to be free from gotchas as much as possible (only things that behave as expected are allowed).

It's a compromise between best practices and not so pretty, but very common code out there.

It's counter-productive to declare 99% of working pages "invalid". With less nitpicking errors validators can have better signal to noise ratio and flag errors that are more likely to cause trouble, and authors are more likely to take those seriously rather than assume validator is impossible to please.

e.g. misnested tags are disallowed, because it's hard to understand how they are interpreted.

DOCTYPE is required, because it disables emulation of IE5 bugs (Quirks Mode).

OTOH unquoted attributes and some unescaped ampersands are allowed, because most often they're parsed unambiguously in a way that authors expect.

d0mine · on Feb 14, 2012

    <!DOCTYPE html>
    <title>My feelings</title>
    I love HTML

it is a valid html5

gioele · on Feb 14, 2012

You are missing the point: I know that you could add `<!DOCTYPE html>` to make that document valid and you know as well. But whoever writes the second snippet does not know because we are not there to point it out. And if you point it out they will look at you puzzled: "You are saying that it is not valid, but it renders, and in exactly the same way! Why are you making all this fuss about this "validity" thing?"

jacobr · on Feb 14, 2012

HTML5 does not mean that any markup is valid. The specification simply defines what a user agent should do when it encounters invalid markup. Previously user agents had to guess what others (IE) did when they encountered a piece of invalid markup.

monsterix · on Feb 14, 2012

I don't see much of problem with loosely typed markup. Because markup is not supposed to be written by only the highly skilled engineers.

An average Joe is supposed to feel great about writing something that renders the way he/she wants without having to go into deeper stuff like semantics, validity or even cross-browser.

This is something that should be left to people who need to know it in their strata, isn't it?

gioele · on Feb 14, 2012

> An average Joe is supposed to feel great about writing something that renders

Indeed HTML is great for that, but the problem is that you never "level up". Once your content renders, you are done. A lot of Joes may be interested in how things work behind the scenes or in making things "correct" more than "just working". It would be great, from a pedagogical point of view, to have browser render Joe's content (for instant gratification) but also to point out that "Ehi, on line 32 you closed </p> before </i>. It should be the other way around because of this rule called nesting, have a look at it". I think we are wasting a lot of man-years around the globe for the lack of such warnings.

In the education of many people, compiler errors and warning had exactly this function: they made you do whatever you wanted (as long as decent) but they would also pointed out the basic mistakes ("Ehi, on line 14 you print the variable prg_name, but that variable has not been initialized, beware").

nske · on Feb 14, 2012

The downside is the average Joe thinks this stuff is easy enough and he should do it professionally. And then:

- Market floods with professionals who antagonize proper web developers, since, to unknoweledgeable clients, the result appears to be the same

- Web floods with sites that behave in unpredictable ways in different browsers

- Proper developers carry the burden of dealing with idiosyncrasies of various browsers

- Browser developers carry the burden of trying guess through amazingly creative atrocities

Although, to be honest, things don't look as grim as they used to regarding the middle two. I just wish some people would stick to HTML and stopped brute-forcing javascript to work.

flomo · on Feb 15, 2012

Except all of those things started happening in about 1994. And that's why 15+ years later we have spec which defines error conditions rather than just the 'proper' way.

(If you weren't online/sentient back then, sites commonly had these 'Best viewed in Netscape' badges on them.)

yuhong · on Feb 15, 2012

I think it started with Mosaic in 1993.