Actually, it's just stored internally as UCS-2. I'm not sure why that really mat...

pilif · on Feb 8, 2011

I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.

You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.

The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.

This is what the .decode and .encode methods are used for.

Have a look at http://diveintopython3.org/strings.html which manages to say this better (and with more words) than I ever would be able to.

eklitzke · on Feb 8, 2011

In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).

In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.

Locke1689 · on Feb 8, 2011

No, the Python internals support surrogates so you can support characters outside the BMP. This makes it (basically) UTF-16.

sedachv · on Feb 9, 2011

Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:

http://en.wikipedia.org/wiki/GB_18030

pilif · on Feb 8, 2011

UTF-16 behaves to UCS-2 as UTF-8 does to ASCII. Meaning: They share the character set. UTF-16 extends UCS-2 by using some reserved characters to indicate that what is following should be interpreted according to UTF-16 rules. So just like UTF-8.

Meaning: Every UCS-2 document is also an UTF-16 document, but not the reverse (just like every ASCII document is also an UTF-8 document).

But as I said below: It doesn't matter and could even be a totally proprietary character set as long as pythons string operations work on that character set and as long as there's a way to decode input data into that set and encode output data from that set.

fhars · on Feb 8, 2011

You should very much care about that, because if your tool stores text as UCS-2, it means that it doesn't support unicode at all, UCS-2 stopped being a valid encoding a long time ago.

Locke1689 · on Feb 8, 2011

As the parent noted, it can be compiled for UTF-32 support. Just recompile if you need the extra characters.

Edit: Also, turns out it's UTF-16. The configure flag is named ucs2.

ot · on Feb 8, 2011

No, it is really UCS-2:

  >>> unichr(0x10000)
  ------------------------------------------------------------
  Traceback (most recent call last):
    File "<ipython console>", line 1, in <module>
  ValueError: unichr() arg not in range(0x10000) (narrow Python build)

If you want to support codepoints greater than 0x10000 you have to recompile with the option UTF32.

I think it must be a constant-lenght encoding to allow s[i] to be constant time.

Locke1689 · on Feb 8, 2011

Guido has a different opinion: http://mail.python.org/pipermail/python-dev/2008-July/080895...

ot · on Feb 8, 2011

You are completely right, I'm sorry about my previous comment.

The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:

  >>> list(u'\U00010000')
  [u'\ud800', u'\udc00']

If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...

Locke1689 · on Feb 8, 2011

No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.

  u'\U00010000'.encode('utf-8')

should produce the same result on every Python version.

ot · on Feb 9, 2011

> You shouldn't be using list on raw unicode strings.

Why? I am using list only to show what are the values of s[0] and s[1].

What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.

Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?

fedd · on Feb 8, 2011

if you want to deal with characters with high numbers you should know code points stuff. for example, the String.length() would return a number of two-bytes chars, not real four bytes characters, which may confuse someone

//edit: this is about Java

btilly · on Feb 8, 2011

Exactly. A Java char is not synonymous with a Unicode code point. But the majority of the time they are synonymous, older documentation claimed that they were the same, and this is the meme that many Java programmers (in my experience) have.

fedd · on Feb 8, 2011

yes. i write my java-based matrix to be code-points aware so that no-one in Japan and China using it would face any problems.

Locke1689 · on Feb 8, 2011

That's actually my point. Python supports Unicode code points and UTF. If you get the output encoding in UTF-8 it would actually be variable length chars. What's important is your coding output, not the internal code point representation.

pieter · on Feb 8, 2011

It leaks through in some places. For example, len(u'\U0001D310') (from the Tai Xuan Jing Symbols) returns 1 on 32-bit wide pythons, and returns 2 on the default 16-bit wide builds.

Locke1689 · on Feb 8, 2011

Nope, that's the correct behavior. Run len on the UTF encode and you'll get the expected result.