Haskell's string handling is actually quite good overall.
Strings are always encoded consistently, and as long as you don't leave the space of Haskell itself, you probably won't have (m)any problems. Sadly, this has caused some Haskell programmers (including myself) to become a little careless when handling strings of any kind, be they
So here's some basic minimum on what you need to get by with text in Haskell. Let's first get a grip on some basic terminology…
Charsets, Encodings and Codepoints
People tend to just throw these three together, but they shouldn't — they're all fundamentally different things!
A character set is just that. It's a set, and it contains characters. Remember: a set is a bunch of stuff. All elements of a set are unique and a set is inherently unordered. That's it.
An encoding defines a function that maps characters from the charset to byte strings.1 Every encoding function needs to have a retraction, i.e. it needs to be reversible. Given f, an encoding function, C its domain (the character set,) and * Bf its codomain (the set of valid byte strings for that encoding) an encoding should satisfy the following two properties: f ⋅ f − 1 = 1 * Bf and f − 1 ⋅ f = 1C, making it an isomorphism.2
Code points enumerate a character set by creating another isomorphism between the charset and numbers. It is very important to note that encodings and code points are distinct!
Why do we need code points when we already have encodings? The Unicode standard for example defines a pretty exhaustive character set that tries to capture all of the world's languages and various other stuff. The code points segment this set into planes and enumerate it sequentially in a well-defined manner.
This makes it possible to refer to a particular glyph via a single number, and without having to represent said glyph graphically (say, because you lack the font) and also without having to specify a particular encoding. Lastly, this makes for a portable representation of glyphs by mere integers.
Char in Haskell
Let U be the Unicode character set, and ü,я ∈ U (Latin small letter 'u' with diaresis, Cyrillic small letter ya.) Let u: U → ℕ be the function assigning code points to elements of U. Then u(ü) = 252, and u(я) = 1103. This is what a
Char value in GHC actually represents:
λ> 'ü' '\252' λ> 'я' '\1103'
Another common way of representing code points is as 3-byte wide hexadecimal byte string values. The letters ü and я are rendered as U+0000FC and U+00044F, respectively, where the first byte stands for the plane this code point belongs to, and it is sometimes ommitted in the case the code point is settled in the BMP, or basic multilingual plane.3
Char data type in Haskell represents characters as Unicode code points, i.e. as numbers. When you
λ> "\0\1\2\3\4\5" "\NUL\SOH\STX\ETX\EOT\ENQ"
Let B be the set of bytes (numbers from 0 to 255) and * Butf8 to be the set of valid UTF-8 byte strings. Let futf8: U → * B be the UTF-8 encoding function. Then futf8(ü) =
0xc3b6, futf8(я) =
0xd18f. Remember that an encoding function doesn't map characters to a number, but to byte strings. In the case of UTF-8, this is one or more bytes.
Let's also define futf16: U → * Butf16, the little endian UTF-16 encoding function. Then futf16(ü) =
0xfc00, and futf16(я) =
0x4F04. Do you notice how the byte strings representing the code points of the letters are different from these byte strings? It all comes down to the fact that we chose LE, since UTF-16, as opposed to UTF-8, depends on endianness.
… and why
Char8.pack isn't one.
So far, the functions we've seen are isomorphisms, and encoding functions should always be structure preserving! But
ByteString.Char8.pack does not fulfil that property, and, consequently, it doesn't form the identity on its domain when composed with its retraction (which an isomorphism does.) The supposed "retraction" of
λ> import qualified Data.ByteString.Char8 as B λ> (B.unpack . B.pack $ "я") == "я" False λ> (B.unpack . B.pack $ "я") == "O" True
Earlier I said that a way of representing unicode code points was by rendering them as 3 byte wide byte strings. Unfortunately, this is not what
ByteString.Char8.pack does. Let's look at the documentation of
Manipulate ByteStrings using Char operations. All Chars will be truncated to 8 bits.
The word truncate is a big red light. You lose information.
λ> B.pack $ [toEnum 255] "\255" λ> B.pack $ [toEnum 256] "\NUL"
There is no legitimate use case for
ByteString.Char8.pack in production code. It exists out of pure laziness, and in order to facilitate that laziness in developers. Even if you're sure you'll only ever process English, it's naïve to assume you're going to get by with ASCII, which is likely to be the only encoding you're not going to have any problems with when using
Just to drive my point home, I'll use all caps:
WHEN YOU USE
ByteString.Char8.pack YOU JUST TAKE A UNICODE CODE POINT AND TRUNCATE IT TO ITS FIRST BYTE AND ARE STILL PRETENDING IT'S TEXT!
Just stop it already.
Correct Text Handling in Haskell
In most cases, you should probably just use
Data.Text.4 In the case of
Data.Text, you can even use its
IsString instance and add the
OverloadedStrings pragma so you would never notice you're using
Text and not
Data.Text.Encoding supplies a couple of very nice
decode functions to marshal your
Text values to and from
λ> import qualified Data.Text as T λ> import Data.Text.Encoding λ> :set -XOverloadedStrings λ> :t encodeUtf8 encodeUtf8 :: T.Text -> B.ByteString λ> encodeUtf8 "ü" "\195\188" λ> encodeUtf16LE "я" "O\EOT"
The weird output we get back from this function is actually just a
ByteString rendered by its
Show instance. But in my opinion,
Show instance is broken and misleading!
Rendering octets as a
String like that makes no sense, because a
String makes a guarantee that it represents valid Unicode code points corresponding to the intended characters. Which this does not. It would be much more sensible to just render the hexadecimal values, without making any promise about representing textual data (because
ByteStrings are NOT textual data.)
λ> import Data.Hex λ> hex . encodeUtf8 $ "ü" "C3BC" λ> hex . encodeUtf16LE $ "я" "4F04"
As a side note, you can also use text-icu, which allows for direct conversions of
Strings and a more comprehensive treatment of encodings.6
- Don't use
- If you have textual data, you should be representing it as
ByteStrings have nothing to do with
Strings — they're very different from one another.
ByteStrings never should be used to represent textual data. As soon as you've encoded a particular piece of text into a
ByteString, treat it as binary data, and do not render it via
Showinstance, but only by using the appropriate decoding and encoding functions.
Technically, that's not true. An encoding can map characters to pretty much anything. Frequencies, Morse signals and other such things are all possible codomains of the encoding function, but we'll restrict ourselves to byte strings. ↩
This is a somewhat simplified view of an encoding, and in practise, it'll be desirable to break this property in order to establish certain equivalence relations; c.f. Unicode equivalence. We'll adopt this simplified view for the point of this discussion, though. Thanks to dmwit on reddit for pointing this out to me. ↩
If you ever happen run across funny U+XXXXXX sequences, DuckDuckGo allows you to decode them into their definitions and into decimal. ↩
Data.Text.packis safe to use ;-) ↩
This extension has some problems of its own, but with
Data.Textit should be safe to use. ↩
Thanks to yitz on reddit for recommending this library to me ↩
A valid use for ByteString.Char8 is for protocols that require ASCII. It's no good for *human* text, but it does have a valid niche.
Yes, tough in this post I was mostly concerned with textual data, for which Char8 is indeed the wrong tool.
So my statement is maybe a bit drastic, but except in super-efficient stuff, it's probably still better to avoid it. JSON, for example, can contain variously encoded data, which would make usage of Char8 dangerous.
Post a Comment