Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> too few programmers understand the difference between a character set and a character encoding

why then they have the same names??? :)))

ps/ http://www.grauw.nl/blog/entry/254 - is this article ok? first in google by "charset encoding difference"



That article starts out OK and then suddenly tries to argue that you can use the terms interchangeably. You can not and you will drown in confusion if you try to. Just imagine that tomorrow, the Chinese introduce their own character set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to represent their language (which makes sense, because the frequency of characters drops off pretty fast and some characters are much more common than others, so you'd like to represent those with one byte).

The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part of the spec:

  Note: This use of the term "character set" is more commonly
  referred to as a "character encoding." However, since HTTP and
  MIME share the same registry, it is important that the terminology also be shared.
Why does MIME use the 'wrong' terminology? Perhaps because the registry is old and the difference between set and encoding was less obvious and relevant back then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps the person that drew it up was inept. Who knows. It doesn't matter, it is still wrong. And don't get me started on the use of character set in MySql...


Unicode is a character set, and the only character set really worth speaking of. The Unicode character set includes almost every character in every writing system on Earth. A string is a piece of text, i.e. an ordered sequence of characters all taken from the same character set.

A character encoding is a mapping/function/algorithm/set of rules which can be used to convert a string into a sequence of bytes and back again.

A character set may have multiple encodings. UTF-8 and UTF-16 are two possible encodings of the Unicode character set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: