Sunday, 29 April 2012

Haskell and the World: Unicode and the Common Misuse of ByteString

Haskell's string handling is actually quite good overall. Strings are always encoded consistently, and as long as you don't leave the space of Haskell itself, you probably won't have (m)any problems. Sadly, this has caused some Haskell programmers (including myself) to become a little careless when handling strings of any kind, be they String, ByteString or Text.

So here's some basic minimum on what you need to get by with text in Haskell. Let's first get a grip on some basic terminology…

Charsets, Encodings and Codepoints

People tend to just throw these three together, but they shouldn't — they're all fundamentally different things!

  • A character set is just that. It's a set, and it contains characters. Remember: a set is a bunch of stuff. All elements of a set are unique and a set is inherently unordered. That's it.

  • An encoding defines a function that maps characters from the charset to byte strings.1 Every encoding function needs to have a retraction, i.e. it needs to be reversible. Given f, an encoding function, C its domain (the character set,) and  * Bf its codomain (the set of valid byte strings for that encoding) an encoding should satisfy the following two properties: f ⋅ f − 1 = 1 * Bf and f − 1 ⋅ f = 1C, making it an isomorphism.2

  • Code points enumerate a character set by creating another isomorphism between the charset and numbers. It is very important to note that encodings and code points are distinct!

Why do we need code points when we already have encodings? The Unicode standard for example defines a pretty exhaustive character set that tries to capture all of the world's languages and various other stuff. The code points segment this set into planes and enumerate it sequentially in a well-defined manner.

This makes it possible to refer to a particular glyph via a single number, and without having to represent said glyph graphically (say, because you lack the font) and also without having to specify a particular encoding. Lastly, this makes for a portable representation of glyphs by mere integers.

Char in Haskell

Let U be the Unicode character set, and ü,я ∈ U (Latin small letter 'u' with diaresis, Cyrillic small letter ya.) Let u: U →  ℕ be the function assigning code points to elements of U. Then u(ü) = 252, and u(я) = 1103. This is what a Char value in GHC actually represents:

λ> 'ü'
'\252'
λ> 'я'
'\1103'

Another common way of representing code points is as 3-byte wide hexadecimal byte string values. The letters ü and я are rendered as U+0000FC and U+00044F, respectively, where the first byte stands for the plane this code point belongs to, and it is sometimes ommitted in the case the code point is settled in the BMP, or basic multilingual plane.3

So the Char data type in Haskell represents characters as Unicode code points, i.e. as numbers. When you print something in Haskell, you get back a decimal representation of its code point except for printable ASCII characters, which are represented as themselves. Non-printable ASCII characters are assigned special names:

λ> "\0\1\2\3\4\5"
"\NUL\SOH\STX\ETX\EOT\ENQ"

Encoding Functions

Let B be the set of bytes (numbers from 0 to 255) and  * Butf8 to be the set of valid UTF-8 byte strings. Let futf8: U →  * B be the UTF-8 encoding function. Then futf8(ü) = 0xc3b6, futf8(я) = 0xd18f. Remember that an encoding function doesn't map characters to a number, but to byte strings. In the case of UTF-8, this is one or more bytes.

Let's also define futf16: U →  * Butf16, the little endian UTF-16 encoding function. Then futf16(ü) = 0xfc00, and futf16(я) = 0x4F04. Do you notice how the byte strings representing the code points of the letters are different from these byte strings? It all comes down to the fact that we chose LE, since UTF-16, as opposed to UTF-8, depends on endianness.

… and why Char8.pack isn't one.

So far, the functions we've seen are isomorphisms, and encoding functions should always be structure preserving! But ByteString.Char8.pack does not fulfil that property, and, consequently, it doesn't form the identity on its domain when composed with its retraction (which an isomorphism does.) The supposed "retraction" of pack is unpack.

λ> import qualified Data.ByteString.Char8 as B
λ> (B.unpack . B.pack $ "я") == "я"
False
λ> (B.unpack . B.pack $ "я") == "O"
True

Earlier I said that a way of representing unicode code points was by rendering them as 3 byte wide byte strings. Unfortunately, this is not what ByteString.Char8.pack does. Let's look at the documentation of Data.ByteString.Char8:

Manipulate ByteStrings using Char operations. All Chars will be truncated to 8 bits.

The word truncate is a big red light. You lose information.

λ> B.pack $ [toEnum 255]
"\255"
λ> B.pack $ [toEnum 256]
"\NUL"

There is no legitimate use case for ByteString.Char8.pack in production code. It exists out of pure laziness, and in order to facilitate that laziness in developers. Even if you're sure you'll only ever process English, it's naïve to assume you're going to get by with ASCII, which is likely to be the only encoding you're not going to have any problems with when using Char8.pack.

Just to drive my point home, I'll use all caps:

WHEN YOU USE ByteString.Char8.pack YOU JUST TAKE A UNICODE CODE POINT AND TRUNCATE IT TO ITS FIRST BYTE AND ARE STILL PRETENDING IT'S TEXT!

Just stop it already.

Correct Text Handling in Haskell

In most cases, you should probably just use Data.Text.4 In the case of Data.Text, you can even use its IsString instance and add the OverloadedStrings pragma so you would never notice you're using Text and not String.5 Data.Text.Encoding supplies a couple of very nice encode and decode functions to marshal your Text values to and from ByteStrings.

λ> import qualified Data.Text as T
λ> import Data.Text.Encoding
λ> :set -XOverloadedStrings
λ> :t encodeUtf8
encodeUtf8 :: T.Text -> B.ByteString
λ> encodeUtf8 "ü"
"\195\188"
λ> encodeUtf16LE  "я"
"O\EOT"

The weird output we get back from this function is actually just a ByteString rendered by its Show instance. But in my opinion, ByteString's Show instance is broken and misleading!

Rendering octets as a String like that makes no sense, because a String makes a guarantee that it represents valid Unicode code points corresponding to the intended characters. Which this does not. It would be much more sensible to just render the hexadecimal values, without making any promise about representing textual data (because ByteStrings are NOT textual data.)

λ> import Data.Hex
λ> hex . encodeUtf8 $ "ü"
"C3BC"
λ> hex . encodeUtf16LE $ "я"
"4F04"

As a side note, you can also use text-icu, which allows for direct conversions of Strings and a more comprehensive treatment of encodings.6

tl;dr

  • Don't use ByteString.Char8
  • If you have textual data, you should be representing it as Data.Text
  • ByteStrings have nothing to do with Strings — they're very different from one another.
  • ByteStrings never should be used to represent textual data. As soon as you've encoded a particular piece of text into a ByteString, treat it as binary data, and do not render it via ByteString's Show instance, but only by using the appropriate decoding and encoding functions.

  1. Technically, that's not true. An encoding can map characters to pretty much anything. Frequencies, Morse signals and other such things are all possible codomains of the encoding function, but we'll restrict ourselves to byte strings.

  2. This is a somewhat simplified view of an encoding, and in practise, it'll be desirable to break this property in order to establish certain equivalence relations; c.f. Unicode equivalence. We'll adopt this simplified view for the point of this discussion, though. Thanks to dmwit on reddit for pointing this out to me.

  3. If you ever happen run across funny U+XXXXXX sequences, DuckDuckGo allows you to decode them into their definitions and into decimal.

  4. Data.Text.pack is safe to use ;-)

  5. This extension has some problems of its own, but with Data.Text it should be safe to use.

  6. Thanks to yitz on reddit for recommending this library to me

2 comments:

Twey said...

A valid use for ByteString.Char8 is for protocols that require ASCII. It's no good for *human* text, but it does have a valid niche.

adimit said...

Yes, tough in this post I was mostly concerned with textual data, for which Char8 is indeed the wrong tool.

So my statement is maybe a bit drastic, but except in super-efficient stuff, it's probably still better to avoid it. JSON, for example, can contain variously encoded data, which would make usage of Char8 dangerous.