Haskell's string handling is actually quite good overall. String
s are always encoded consistently, and as long as you don't leave the space of Haskell itself, you probably won't have (m)any problems. Sadly, this has caused some Haskell programmers (including myself) to become a little careless when handling strings of any kind, be they String
, ByteString
or Text
.
So here's some basic minimum on what you need to get by with text in Haskell. Let's first get a grip on some basic terminology…
Charsets, Encodings and Codepoints
People tend to just throw these three together, but they shouldn't — they're all fundamentally different things!
A character set is just that. It's a set, and it contains characters. Remember: a set is a bunch of stuff. All elements of a set are unique and a set is inherently unordered. That's it.
An encoding defines a function that maps characters from the charset to byte strings.1 Every encoding function needs to have a retraction, i.e. it needs to be reversible. Given f, an encoding function, C its domain (the character set,) and * Bf its codomain (the set of valid byte strings for that encoding) an encoding should satisfy the following two properties: f ⋅ f − 1 = 1 * Bf and f − 1 ⋅ f = 1C, making it an isomorphism.2
Code points enumerate a character set by creating another isomorphism between the charset and numbers. It is very important to note that encodings and code points are distinct!
Why do we need code points when we already have encodings? The Unicode standard for example defines a pretty exhaustive character set that tries to capture all of the world's languages and various other stuff. The code points segment this set into planes and enumerate it sequentially in a well-defined manner.
This makes it possible to refer to a particular glyph via a single number, and without having to represent said glyph graphically (say, because you lack the font) and also without having to specify a particular encoding. Lastly, this makes for a portable representation of glyphs by mere integers.
Char
in Haskell
Let U be the Unicode character set, and ü,я ∈ U (Latin small letter 'u' with diaresis, Cyrillic small letter ya.) Let u: U → ℕ be the function assigning code points to elements of U. Then u(ü) = 252, and u(я) = 1103. This is what a Char
value in GHC actually represents:
λ> 'ü'
'\252'
λ> 'я'
'\1103'
Another common way of representing code points is as 3-byte wide hexadecimal byte string values. The letters ü and я are rendered as U+0000FC and U+00044F, respectively, where the first byte stands for the plane this code point belongs to, and it is sometimes ommitted in the case the code point is settled in the BMP, or basic multilingual plane.3
So the Char
data type in Haskell represents characters as Unicode code points, i.e. as numbers. When you print
something in Haskell, you get back a decimal representation of its code point except for printable ASCII characters, which are represented as themselves. Non-printable ASCII characters are assigned special names:
λ> "\0\1\2\3\4\5"
"\NUL\SOH\STX\ETX\EOT\ENQ"
Encoding Functions
Let B be the set of bytes (numbers from 0 to 255) and * Butf8 to be the set of valid UTF-8 byte strings. Let futf8: U → * B be the UTF-8 encoding function. Then futf8(ü) = 0xc3b6
, futf8(я) = 0xd18f
. Remember that an encoding function doesn't map characters to a number, but to byte strings. In the case of UTF-8, this is one or more bytes.
Let's also define futf16: U → * Butf16, the little endian UTF-16 encoding function. Then futf16(ü) = 0xfc00
, and futf16(я) = 0x4F04
. Do you notice how the byte strings representing the code points of the letters are different from these byte strings? It all comes down to the fact that we chose LE, since UTF-16, as opposed to UTF-8, depends on endianness.
… and why Char8.pack
isn't one.
So far, the functions we've seen are isomorphisms, and encoding functions should always be structure preserving! But ByteString.Char8.pack
does not fulfil that property, and, consequently, it doesn't form the identity on its domain when composed with its retraction (which an isomorphism does.) The supposed "retraction" of pack
is unpack
.
λ> import qualified Data.ByteString.Char8 as B
λ> (B.unpack . B.pack $ "я") == "я"
False
λ> (B.unpack . B.pack $ "я") == "O"
True
Earlier I said that a way of representing unicode code points was by rendering them as 3 byte wide byte strings. Unfortunately, this is not what ByteString.Char8.pack
does. Let's look at the documentation of Data.ByteString.Char8
:
Manipulate ByteStrings using Char operations. All Chars will be truncated to 8 bits.
The word truncate is a big red light. You lose information.
λ> B.pack $ [toEnum 255]
"\255"
λ> B.pack $ [toEnum 256]
"\NUL"
There is no legitimate use case for ByteString.Char8.pack
in production code. It exists out of pure laziness, and in order to facilitate that laziness in developers. Even if you're sure you'll only ever process English, it's naïve to assume you're going to get by with ASCII, which is likely to be the only encoding you're not going to have any problems with when using Char8.pack
.
Just to drive my point home, I'll use all caps:
WHEN YOU USE ByteString.Char8.pack
YOU JUST TAKE A UNICODE CODE POINT AND TRUNCATE IT TO ITS FIRST BYTE AND ARE STILL PRETENDING IT'S TEXT!
Just stop it already.
Correct Text Handling in Haskell
In most cases, you should probably just use Data.Text
.4 In the case of Data.Text
, you can even use its IsString
instance and add the OverloadedStrings
pragma so you would never notice you're using Text
and not String
.5 Data.Text.Encoding
supplies a couple of very nice encode
and decode
functions to marshal your Text
values to and from ByteStrings
.
λ> import qualified Data.Text as T
λ> import Data.Text.Encoding
λ> :set -XOverloadedStrings
λ> :t encodeUtf8
encodeUtf8 :: T.Text -> B.ByteString
λ> encodeUtf8 "ü"
"\195\188"
λ> encodeUtf16LE "я"
"O\EOT"
The weird output we get back from this function is actually just a ByteString
rendered by its Show
instance. But in my opinion, ByteString
's Show
instance is broken and misleading!
Rendering octets as a String
like that makes no sense, because a String
makes a guarantee that it represents valid Unicode code points corresponding to the intended characters. Which this does not. It would be much more sensible to just render the hexadecimal values, without making any promise about representing textual data (because ByteString
s are NOT textual data.)
λ> import Data.Hex
λ> hex . encodeUtf8 $ "ü"
"C3BC"
λ> hex . encodeUtf16LE $ "я"
"4F04"
As a side note, you can also use text-icu, which allows for direct conversions of String
s and a more comprehensive treatment of encodings.6
tl;dr
- Don't use
ByteString.Char8
- If you have textual data, you should be representing it as
Data.Text
ByteString
s have nothing to do withString
s — they're very different from one another.ByteString
s never should be used to represent textual data. As soon as you've encoded a particular piece of text into aByteString
, treat it as binary data, and do not render it viaByteString
'sShow
instance, but only by using the appropriate decoding and encoding functions.
Technically, that's not true. An encoding can map characters to pretty much anything. Frequencies, Morse signals and other such things are all possible codomains of the encoding function, but we'll restrict ourselves to byte strings. ↩
This is a somewhat simplified view of an encoding, and in practise, it'll be desirable to break this property in order to establish certain equivalence relations; c.f. Unicode equivalence. We'll adopt this simplified view for the point of this discussion, though. Thanks to dmwit on reddit for pointing this out to me. ↩
If you ever happen run across funny U+XXXXXX sequences, DuckDuckGo allows you to decode them into their definitions and into decimal. ↩
Data.Text.pack
is safe to use ;-) ↩This extension has some problems of its own, but with
Data.Text
it should be safe to use. ↩Thanks to yitz on reddit for recommending this library to me ↩
2 comments:
A valid use for ByteString.Char8 is for protocols that require ASCII. It's no good for *human* text, but it does have a valid niche.
Yes, tough in this post I was mostly concerned with textual data, for which Char8 is indeed the wrong tool.
So my statement is maybe a bit drastic, but except in super-efficient stuff, it's probably still better to avoid it. JSON, for example, can contain variously encoded data, which would make usage of Char8 dangerous.
Post a Comment