a glob of nerdishness

November 18, 2009

Go Unicode

written by natevw @ 3:27 pm

I’ve been eagerly learning about Go lately. It’s a nascent systems programming language with a nice design and some great features.

I have also gotten back into reading through my Unicode 5.0 book during the past weeks, so when I saw that Go had a built in string type I was immediately curious as to what that meant. My initial conclusion was not good.

Then I realized one of the Go team’s earlier inventions was UTF-8, a Unicode encoding with many pragmatic properties. After a little more research, here’s the skinny.

Go’s strings:

This last point could be seen as a drawback, because it means that <LATIN SMALL LETTER E WITH ACUTE> will not compare equal to the equivalent <LATIN SMALL LETTER E> + <COMBINING ACUTE ACCENT>. However, to do this natively in Go would require each standalone binary to include a large set of character code table information. Furthermore, there are two equivalence forms defined by Unicode. I have an opinion on which one the Go compiler itself should eventually use for token comparison, but for runtime use neither could serve as the one meaning of the string comparison operator.

Like so much of Go’s design, the way strings work is an elegant compromise that encourages useful idioms without making decisions a programming language shouldn’t. Normalization goes beyond settling the encoding question, and begins a climb up the tall stack of human language concerns. (Check out the ICU project for a sampling of basic Unicode toppings.)

One final note about the implications of Go’s string type: In C, it can be tempting to use string functions on binary data known to contain no inner ‘\0′ bytes. Go’s type system should make this obvious, but use uint8 slices — and never strings — for binary data in Go. Even if your bytes have no terminating ‘\0′ characters, trying to iterate over binary data as string characters will not yield what you expect due to the way UTF-8 encoding works.