You are completely right, I'm sorry about my previous comment.
The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:
>>> list(u'\U00010000')
[u'\ud800', u'\udc00']
If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...
No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.
u'\U00010000'.encode('utf-8')
should produce the same result on every Python version.
> You shouldn't be using list on raw unicode strings.
Why? I am using list only to show what are the values of s[0] and s[1].
What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.
Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?
The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:
If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...