Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Formal Spec for GitHub Flavored Markdown (githubengineering.com)
475 points by samlambert on March 14, 2017 | hide | past | favorite | 89 comments


It's odd how neither this post, nor the spec, nor GitHub's "Mastering Markdown" help page[1], nor the more complete "Basic writing and formatting syntax" page[2], mentions the fact that GitHub treats every newline as a hard break.

CommonMark contains this little sentence to work around its specified behavior, which is left untouched in the GFM spec:

> A renderer may also provide an option to render soft line breaks as hard line breaks.

I'd say whether or not it does this is a rather important thing to mention. When I write Markdown documents for GitHub I have to change my editor settings, only because of this.

[1]: https://guides.github.com/features/mastering-markdown/

[2]: https://help.github.com/articles/basic-writing-and-formattin...


GFM itself leaves that line unchanged because we don't actually change that option in our implementation — the reference implementation `cmark` (which we built upon) supplies the "hardbreaks" option, described as follows:

> --hardbreaks Treat newlines as hard line breaks

We turn this option on when rendering issues, issue comments and so on, but leave it off when rendering blobs (such as README.md). Both are GFM, one just uses this option to make it more conducive to communication.


This is a very important bit of information.

Do you mind to make sure this is reflected in docs?


Good to know. Why is the setting different for issues than for blobs?


This is covered in the post, no?

> There is a fundamental difference between these two kinds of content: the user comments are stored in our databases, which means their Markdown syntax can be normalized (e.g. by adding or removing whitespace, fixing the indentation, or inserting missing Markdown specifiers until they render properly). The Markdown documents stored in Git repositories, however, cannot be touched at all, as their contents are hashed as part of Git’s storage model.


Sure, but I'm still confused as to why the display format setting has anything to do with the on-disk format. To me they seem completely orthogonal.


In general people have come to expect that hitting return once in a comment field on GitHub will produce a newline, having been the case for many years, so we try to preserve that expectation. Hence not changing the option being used when rendering comments.

Conversely, they don't expect the same from Markdown files stored in their repository (e.g. I put each sentence in a paragraph on its own line for my blog, for easier diffing and editing). Additionally, we couldn't normalise these documents even if we wanted (to prevent everything breaking by being over-vertically spaced). Hence not changing the option being not used in this case!


Because using the new display format with a non-normalized source would break the display.


> It's odd how neither this post, nor the spec, nor GitHub's "Mastering Markdown" help page[1], nor the more complete "Basic writing and formatting syntax" page[2], mentions the fact that GitHub treats every newline as a hard break.

Well, actually it does say that "Hard line breaks are for separating inline content within a block."

https://github.github.com/gfm/#hard-line-break

So two spaces at the end of the line for a <br>, and an empty line for <p>.


Right, but that's not what GFM does. GFM turns any single newline into a <br>.


Depends on where you use it. All file backed content (ie. repo data) does not use it, while the communications systems (PRs and issues) use it.


As of this announcement, file backed content _does_ use GFM (try a table!), but not the `hardbreaks` option.


GitHub devs & Markdown enthusiasts at large, please consider contributing some brainpower to these last remaining issues that are blocking the v1.0 release of CommonMark:

https://talk.commonmark.org/t/issues-we-must-resolve-before-...


This is great! A couple of years back, there was a failed attempt at standardizing this - http://www.vfmd.org/ and http://www.vfmd.org/vfmd-spec/specification/. GitHub given it's popularity will surely succeed more.


GitHub's spec here is based on [CommonMark][1], which has been around for a while now, and [was originally authored][2] by a group of representatives from GitHub, Reddit, and Stack Exchange.

[1]: http://commonmark.org/

[2]: https://blog.codinghorror.com/standard-flavored-markdown/


based on, compatible with, or is common mark?

I'd hate to see them pushing a different spec around, that would solve nothing


> based on, compatible with, or is common mark?

The first two. [The GFM spec][1] is literally just CommonMark with a few extra extensions added. They even highlighted the new sections green in the spec to make it clear where the GFM spec differs from CommonMark. Everything else is word-for-word identical.

[1]: https://github.github.com/gfm/


Some quick reading of the linked article says this spec provides a few optional, superset features on top of CommonMark, that it did not contain (like tables, etc).


There was also Common Mark (http://commonmark.org/), which failed IMHO mostly due to John Gruber taking offense at their first choice of name, Common Markdown. Will formalising this as 'GitHub Flavored Markdown' similarly cause offense?


GitHub Flavored Markdown has been a thing for a while. The only difference is that now they have a formal spec for it.

Also, CommonMark failed? News to me. Last I heard it was still under active development, years after the drama with Gruber.


IIRC, their first name choice was Standard Markdown. I don't blame Gruber for being upset at that.


> I don't blame Gruber for being upset at that.

I do, when he has abandoned his project's raggedy implementation yet defends the trademark viciously.


He hasn't abandoned it.


Last release is from 2004 and he is not interested at all to fix the fact that it is severely underspecified..


It's underspecified for what others want. If it didn't do what Gruber needed it to do, surely he would extend it, no?


You have to go look at the source code of Grubers implementation to figure out what markdown actually is. Or do some empirical studies with different inputs. That is what I mean by underspecified. His specification is not detailed enough to implement a markdown parser. So in reality it is abondonware.


TFA is literally all about how GFM is based on the CommonMark spec.


The Rust ecosystem has used Markdown for a long time, but is moving to CommonMark as we speak.

Given this news from GitHub, it's very exciting.


It didn't fail. CommonMark is the standard implemented in Pandoc, and the projects share an author.


Did it 'fail'? It seems to have a decent amount of use. And it fixed a bunch of problems with the original Markdown spec (or lack thereof).

I suppose since one of their goals was to get Github and StackExchange 'out of the markdown business' but neither use CommonMark, and further, Github now has put work into creating their own spec, they failed in that aspect.


> Github now has put work into creating their own spec

A big aspect of the post is talking about how GFM is now a set of extensions to CommonMark. This is a huge win, not a failure.


> one of their goals was to get Github and StackExchange 'out of the markdown business' but neither use CommonMark

FWIW, StackExchange [uses CommonMark][1] for their new StackOverflow Documentation site and has [been planning][2] to migrate Q&A to CommonMark for some time now.

[1]: https://meta.stackexchange.com/questions/125148/implement-st...

[2]: https://meta.stackexchange.com/q/238957/192171


Who knows, but it'd be pretty irrational if it did. Whether it was fair or not, the "Common Markdown" name caused a kerfluffle because of the implication that it was claiming to be the One True Markdown. The name "GitHub Flavored Markdown" only implies that it's base Markdown plus GitHub extras, plus it's been known by that name for a long time now.


IIRC, original name was "Standard Markdown" which the original author had issue with... "Common Mark" or "Github Flavored Markdown" not implying such.


Previous discussion on the name: https://news.ycombinator.com/item?id=8270771


It's a specification based on the commonmark specification. Both are not a formal specs. They are more of an informal specification with some edge-cases listed (in contrast to the original markdown specification which has known unspecified edgecases).


I really wish their was concise formal spec for markdown, rather than a multi-page essay. It makes it incredibly difficult for anyone trying to create something to parse it. There is no mechanical way for go from spec -> parser.

I think its quite difficult to do though.


Is there any common spec where you can mechanically go from spec to parser? HTTP, SMTP, DNS, HTML, CSS, Javascript, Ruby, Python, C, ... I basically know of nothing in widespread use with a spec that can actually be converted directly into working code.


Does implementing TCP/IP stack using the RFC (as in parsing diagrams straight from RFC) counts?

Then it was done in OMeta [1].

Previous discussion: A full TCP/IP stack in under 200 LoC (and the power of DSLs) [2].

There's also a PNG parser (but it is not parsing any documentation) in 20 lines of OMeta [3].

[1] http://www.moserware.com/2008/04/towards-moores-law-software...

[2] https://news.ycombinator.com/item?id=846028

[3] http://joshondesign.com/2013/03/18/ConciseComputing


For the parser itself, yes. There are parser generators that take a spec(BNF, PEG) and output a parser that can parse the language.

What you do with the parsed tree is up you though.


Is it really useful to write a formal spec for Github Markdown? The software they use to parse and render it is open source. If you want to know how exactly something works, you can read the source.


Is it really useful to write a formal spec for HTML? The software we use to parse and render it is open source. If you want to know how exactly something works, you can read the source.

;)


Having used and maintained a Swift translation of the StackOverflow .NET markdown processor, please, when there is a proposal to use source as a spec, burn it with fire. Scatter the ashes.


Yes, if you also want to have more implementations, that can differ in e.g. license or programming language.

For instance it is the reason why there's no reimplementation of TeX.


There is pdflatex, luatex,...

Nobody stops you from translating the Ruby or whatever into your favorite language.


Reading the sources and trying to understand it takes a lot longer than looking at bnf and cranking out recursive descent parser based on it or a parser generator.


It's not possible to write a BNF for Markdown. At least not an unambiguous, useful BNF.

http://roopc.net/posts/2014/markdown-cfg/


I hate calling things formal which aren't. Also, I hope this puts to rest any idea of Markdown being "simple".


I have to wonder why this isn't done in the form of a context-free grammar, like Hitman[0] uses. Specs in English are still too vague for my liking.

https://github.com/chameco/Hitman


Is it even possible to define a context-free grammar for markdown?


It is not, and not just because of a few idiosyncrasies like C which require context.

Fundamentally, markdown was specified as some pattern matching and English description. The original specification was not done thinking of productions and grammar rules, and you typically don't get there by accident.

See http://roopc.net/posts/2014/markdown-cfg/ for a detailed exposition of how the '*' character in markdown is sufficient to ruin any chance of a CFG.

I have sometimes pondered how to make a markdown-like language with a simple production based grammar. I have not succeeded and would appreciate any pointers. The criteria being that is has to have something like the minimal intrusion into the prose of markdown.


Is there some kind of more general grammar that can encode the Markdown spec?


There was some discussion of this on the CommonMark forums shortly after the initial release of CommonMark:

https://talk.commonmark.org/t/commonmark-formal-grammar/46

See in particular these comments by maradydd:

https://talk.commonmark.org/t/commonmark-formal-grammar/46/1...

https://talk.commonmark.org/t/commonmark-formal-grammar/46/2...

https://talk.commonmark.org/t/commonmark-formal-grammar/46/2...

I'm not sure they ever came to a conclusive answer (at least on this thread).

Edit: Here is JGM himself saying he doesn't know: https://talk.commonmark.org/t/commonmark-formal-grammar/46/3...


If GitHub is normalizing comments anyways, I wonder they could have adopted a CFG.

Overall this a step in the right direction but the whole saga is a perfect microcosm of our understructure cranking out pooly-understood stuff which comes back to bite us and cannot be tamed.


Interesting. I wonder if there's a thesis in solving this problem.


https://github.github.com/gfm/#disallowed-raw-html-extension...

Why this? This is not a working blacklist to prevent XSS (e.g. onload="...")


Hi there! As the spec explains, this is a Markdown specific blacklist that prevent the tags that would otherwise "break" the content of the Markdown document.

A document that contains these tags will not be parsed properly by an HTML5 compliant parser; the parser will "swallow" other chunks of Markdown content that come after the tags. Hence, we disable the tags altogether.

This is an UX feature, not a security feature. XSS prevention, and a plethora of other security checks, are performed by our user content stack -- but this functionality is shared for all markup languages in GitHub (MD, RST, ASCIIDOC, ...), so it's not discussed in this spec.


Wow, TIL about the <plaintext> tag. Here I thought I knew most of the corner cases of HTML.


What's bonkers to me is that there is no closing tag, everything after it is no longer parsed as HTML.


If it disallowed certain words (i.e. treated them specially instead of just reproducing the text as written) such as "</plaintext>" it wouldn't be plain text.


It's also a great way to check if user input's being parsed server-side! :)


It's not meant as an xss prevention but as a safety to prevent rendering errors.


This is great -- lack of a standard that was actually used (unlike CommonMark) was one of my main issues with Markdown (http://ericholscher.com/blog/2016/mar/15/dont-use-markdown-f...) -- It's really great to see GitHub leading in this department, and it gives me hope that one day we might actually have Markdown that is portable between implementations.


The github spec is literally CommonMark with extensions.

https://github.github.com/gfm/

http://spec.commonmark.org/0.27/


This is actually very closely based on CommonMark, according to the article.


Perhaps you missed Jeff Atwood (codinghorror)'s standardized markdown spec, which is about as close as you'll get:

http://commonmark.org/


Yea, I mention commonmark in the post, and referred to it in the "standard that was used" part -- commonmark is great, but wasn't widely adopted.

Updated my original post to be more clear.


At the risk of starting a mini-flame war, is RST a more cohesive format? If one was to pick one of the two formats to start using for personal documentation, which format should one choose?


I enjoyed this article http://eli.thegreenplace.net/2017/restructuredtext-vs-markdo... and since the article didn't provide supporting evidence for the Linux/OpenCV/LLVM assertion: https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html and http://docs.opencv.org/2.4/doc/tutorials/introduction/how_to... and https://github.com/llvm-mirror/llvm/blob/master/docs/index.r... respectively

I do think rST is waaay more expressive, but I also recognize that in many of the instances one would want to use markup in a chat or PR situation, the expressiveness likely wouldn't be well received if the trade-off is verbosity.

This is something in life that I file away with competing regex standards: my brain just has to switch languages based on the app in which I'm typing (between markdown, pseudo-markdown (ahem, Slack), org-mode, rST, etc).


For personal documentation (assuming you mean notes): follow your heart. Personally, though, I wouldn't use RST for any non-python public documentation at this point, but for my personal notes, it's hard to beat the extensibility of RST.


Now if only org-mode could define a sane, parsable format.


What do you mean?


HTML used to serve as a simple way to format a document. Now HTML is too complex for that purpose. Introduce markdown. In ten years, markdown will be too complex for formatting documents.

I am big of markdown in case I didn't make that clear. I love it


It wasn't an issue of simplicity, it was about HTML being extremely hard to sanitize so it isn't turing complete. Having formatting (that can still line up with your sites styles) while knowing you can't get script injection is pretty useful.


I don't think it's that HTML is too complex. It's more that Markdown allows documents to be marked up in a way that "natural" and easy to read if you're reading the plain text version of it.


This is great news! Does anybody have a recommendation for a Javascript parser for Formal GFM? (I know there are a million JS MD parsers; I'm looking for a good one that will let me serve GFM docs over HTTP and render them on the browser.)



Looks good. Thanks.


So that's why my project wikis have suddenly stopped rendering markdown properly. I've been trying to figure out WTF was going on since yesterday!


I'm really happy to see this. It's actually quite frustrating that although markdown is so nice, it barely has a consistent standard. It's almost impossible to use it cross-service.

Hopefully now that Github has standardised their own flavour of it (and quite a nice flavour too), more people will start to use it.

Of course there is the obligatory XKCD: https://xkcd.com/927/


I would argue that it is consistent now.

At the lowest level, you have commonmark. Then, you have extensions at the top, such as GFM.

If Pandoc/Github/Reddit/SO/kramdown switch, that accounts for almost all front- and back-end cases that I care about.

And given that the first four were actively involved with commonmark, I would take as given that they will support commonmark or a superset of it.


Pandoc already supports different flavors of markdown (commonmark included) and you can add/remove extensions to base flavors.


web2py handled all of these issues and made its markup language extensible with its markmin specification: http://www.web2py.com/init/static/markmin.html


What's wrong with http://www.vfmd.org/ ???


Why no ~~strike out~~ in spec?



Is their no corresponding equivalent of INS? I would have thought something like

    +new text+
could be used...


Jeesh, about time!


Is it ok if I promote my domain here? I read the guidelines but it doesn't mention anything regarding self promotion. Sorry if it ain't appropriate but anyone looking for a relevant domain (markdown.in) please get in touch or any suggestion if it is better to develop it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: