Go is Weird: Strings
Having done extensive programming in C, I am not particularly spoiled when it comes to idiosyncrasies of a language’s “string” type. Yet, Go’s string types keeps tripping me up — why does it all still have to be that complicated?
There are two answers to this question:
- Because strings, in a globalized world, are complicated
- Because the designers of the Go language made some non-intuitive choices
tl;dr: What is a String?
Generally speaking, to a programmer, a “string” is an array of characters.
But Go’s string
type is not a string in this sense. It’s not even a
UTF-8 string. Instead, it’s an immutable slice of bytes. That’s
right: a Go string is not a sequence of characters, it’s also not a
sequence of “runes”, but a sequence of bytes.
The bytes in the byte slice can contain anything: their content and
format is not restricted, and completely arbitrary. In particular,
there is no requirement that the bytes are valid UTF-8. Go makes a
strict separation between the data structure itself (string
) and
the interpretation of its contents (we will come back to that below).
In fact, the only real difference between Go’s string
and an
(immutable) byte slice is that a string
can be transformed into a
collection of “runes” (essentially, characters), by one of two
built-in methods: either explicitly (using runes := []rune(str)
) or
implicitly, as part of the loop-range construct (for idx, rune := range str {...}
). It is in this transformation, and only here, that
the encoding of the information contained in the bytes matters, and
where Go requires the use of UTF-8.
The primary source of confusion is that the two most commonly-used
operations on Go sequences (namely len(s)
and s[i]
), when applied
to strings, operate on bytes, not characters or “runes”: len(s)
returns the number of bytes (not characters) in the string, and
s[i]
returns the byte at position i
, not the character.
What makes this doubly confusing is that, when the string contains
only 7bit ASCII characters, both len(s)
and s[i]
seem to do
the right thing.
In a way, the worst of all worlds.
What’s a Character — String Storage and Encodings
The behavior of the Go string
type makes more sense when one realizes
that Go strings make a strict distinction between the data storage
and its interpretation.
Obviously, a sequence of bytes, by itself, has no semantics at all: we need some out-of-band information to interpret the bytes appropriately (the bytes might contain a PNG-encoded image, for instance). Even when we know that the sequence of bytes contains textual data, we still need information about the encoding to break the byte sequence into characters. The problem now is that there is not a single encoding — in fact, multiple encodings coexist (UTF-8, UTF-16, UTF-32 are only the most common).
Go’s string
data type tries to accommodate all possibilities by
separating the data storage from the encoding: the string
type
handles the storage, but does not enforce a particular choice of
encoding.
Go only expects a particular encoding when converting a string
to a sequence of characters (or “runes”). Go provides two mechanisms
for doing so:
- explicitly:
runes := []rune(str)
- implicitly in a
for-range
loop:for idx, rune := range str { ... }
In both of these cases, Go expects the string to be encoded using
UTF-8; invalid characters are replaced by the replacement character
\uFFFD
(which is usually rendered like this: �).
There are other ways to perform character-level operations on a
string
variable, which make the encoding explicit: the
packages unicode/utf8
and unicode/utf16
provide functions
such as RuneCountInString(string)
(but not RuneAt(i)
!).
Also, note that the top-level package is unicode
, not
encoding
!
Go’s rune
data type, by the way, is simply a typedef for int32
:
a type large enough to hold any code point, using any of the common
encodings (including UTF-32). It does not have any other special
meaning — you can do your arithmetic with runes, if you like
to. (In the same spirit, byte
is simply a typedef for int8
.)
There is one other place where Go mandates UTF-8: Go source files
themselves must be UTF-8. This has the curious side effect that
string literals (such as: str := "Hello, World"
) are automatically
UTF-8 encoded.
In a similar way, a Go character (pardon: rune) literal (like 'a'
)
is simply a number of type int32
. In other words, the three
expressions ' '
, \0x20
, and 32
are all identically equal!
Finally, because rune literals are evaluated at compile time, a
7bit-clean expression such as 'a'
fits into a byte
.
How are strings different than byte buffers?
All this begs the question: why do we have a string
type at all —
would things not be easier and clearer when everything was handled
explicitly as byte buffers []byte
?
The differences seem slight. Besides being immutable, string
values are also comparable. (Something that byte slices are not —
although the bytes
package provides a Compare(a, b []byte)
function,
as does the strings
package!)
And the string
type supports conversion to []rune
, by one of
the two methods described above.
Why This Way?
There are two questions that naturally arise:
- Why do Go strings not enforce an encoding (say: UTF-8) at all times?
- Why does Go not provide methods to operate on individual characters, only on bytes?
I believe the answer to the first question is the desire to be able to read any text, no matter what its encoding is. Unless the program needs to operate on individual characters, it never needs to know the encoding at all — all bulk string operations (trim, split, append, etc) can be done independent of the specific encoding. Given that, forcing each input string to be converted to (say) UTF-8, and possibly back to its original encoding on output, seems wasteful.
The reason that functions operating on individual characters are
missing seems to be in the spirit of the Go language to avoid
operations that seem simple, but carry invisible costs. Given the
variable-length encoding of UTF-8, the only way to find the i
th
character in a string is to walk the string. Finding two characters
requires walking the string twice. At that point, it is more
efficient to walk the string only once, namely to break it
into runes explicitly (using []rune(str)
), and then operate on the
slice of runes.
Bitchn
All that being said, I still find Go’s handling of strings, characters, and encodings confusing and difficult. It all sort of makes sense, but it is not an example of clarity and elegance; one of those instances, where I get the feeling that the designers of the Go language didn’t really think things through to the end. There has to be a better way.
The separation of storage and encoding makes sense. I am less certain
that it makes sense to support a string type with bulk string
operations (split, append, etc), but without an explicit encoding. In
my experience, when working with strings, sooner rather than later I
need to operate on individual characters as well, and hence the
encoding comes in by the back door pretty quickly, anyway! But my
experience may be atypical; I don’t know. Finally, having parallel
data structures (namely string
and []byte
), which are almost, but
not entirely like each other, is weird and confusing.
But what I really don’t like is how some critical pieces of information
are unnecessarily obscured — unless you are a language lawyer,
it is not obvious that []rune(str)
requires UTF-8. Should this not
have been made explicit (whatever: utf8.StringToRunes(str)
or so)?
Similarly regarding the for-range
loop construct — how is
anybody supposed to guess that this operation silently requires
UTF-8?
But the price for the worst design must go to the decision to let the
two most basic operations for any collection (namely len()
and []
),
when applied to string
, operate on bytes, not runes. That is not
how a programmer expects a string
type to work. It also seems to be
getting things exactly backwards: I can’t think of a single relevant use
case where I would want to know the length of a string in bytes, or
access an individual byte of a multi-byte character (pardon: “rune”,
of course). I guess this is a consequence of not enforcing an encoding
from the outset: without an encoding, there are no “characters” to index,
only bytes. (This is how one strange design decision leads to another.)
This is particularly insidious, because it so often seems to work:
as long as you stick to 7bit-clean ASCII. But it will break the moment
you encounter “runes” from a wider character set — in other
words, Go’s len
and []
give you exactly the wrong sense of
security. A strange decision.