a glob of nerdishness

November 18, 2009

Go Unicode

written by natevw @ 3:27 pm

I’ve been eagerly learning about Go lately. It’s a nascent systems programming language with a nice design and some great features.

I have also gotten back into reading through my Unicode 5.0 book during the past weeks, so when I saw that Go had a built in string type I was immediately curious as to what that meant. My initial conclusion was not good.

Then I realized one of the Go team’s earlier inventions was UTF-8, a Unicode encoding with many pragmatic properties. After a little more research, here’s the skinny.

Go’s strings:

This last point could be seen as a drawback, because it means that <LATIN SMALL LETTER E WITH ACUTE> will not compare equal to the equivalent <LATIN SMALL LETTER E> + <COMBINING ACUTE ACCENT>. However, to do this natively in Go would require each standalone binary to include a large set of character code table information. Furthermore, there are two equivalence forms defined by Unicode. I have an opinion on which one the Go compiler itself should eventually use for token comparison, but for runtime use neither could serve as the one meaning of the string comparison operator.

Like so much of Go’s design, the way strings work is an elegant compromise that encourages useful idioms without making decisions a programming language shouldn’t. Normalization goes beyond settling the encoding question, and begins a climb up the tall stack of human language concerns. (Check out the ICU project for a sampling of basic Unicode toppings.)

One final note about the implications of Go’s string type: In C, it can be tempting to use string functions on binary data known to contain no inner ‘\0′ bytes. Go’s type system should make this obvious, but use uint8 slices — and never strings — for binary data in Go. Even if your bytes have no terminating ‘\0′ characters, trying to iterate over binary data as string characters will not yield what you expect due to the way UTF-8 encoding works.

2 Comments

  1. [editor's note: this was spam for some vitamin site, but is a direct quote from a somewhat relevant part of an important paper.]

    End-to-end arguments are a kind of “Occam’s razor” when it comes to choosing the functions to be provided in a communication subsystem. Because the communication subsystem is frequently specified before applications that use the subsystem are known, the designer may be tempted to “help” the users by taking on more function than necessary. Awareness of end-to-end arguments can help to reduce such temptations.

    Comment by zinc — November 19, 2009 @ 6:31 am

  2. I’d rather see strings having the interface of a sequence of code points with actual storage being implementation-defined – like NSStrings back in Unicode 1.x. Still, Go’s approach seems reasonable… unlike, say, Arc’s “strings are hard, let’s have a bucket of bytes instead” approach. I definitely agree that normalization shouldn’t be part of a basic equality operator.

    Going off on a wild tangent, I recently had to explain to some Linux distro why Oolite requires its own build of Spidermonkey. The specific reason is that it requires JS_STRINGS_ARE_UTF8 to be defined, which it isn’t in normal builds; further research showed that the Mozilla folks had tried that, but it had blown up in their faces because the XUL API uses strings as data blobs. Last I saw, the plan was to fix this (and break compatibility) in Gecko 2.0/Firefox 4. Strings are for text, data is from Venus!

    Comment by Jens Ayton — November 19, 2009 @ 3:19 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.