a glob of nerdishness

November 18, 2009

Go Unicode

written by natevw @ 3:27 pm

I’ve been eagerly learning about Go lately. It’s a nascent systems programming language with a nice design and some great features.

I have also gotten back into reading through my Unicode 5.0 book during the past weeks, so when I saw that Go had a built in string type I was immediately curious as to what that meant. My initial conclusion was not good.

Then I realized one of the Go team’s earlier inventions was UTF-8, a Unicode encoding with many pragmatic properties. After a little more research, here’s the skinny.

Go’s strings:

This last point could be seen as a drawback, because it means that <LATIN SMALL LETTER E WITH ACUTE> will not compare equal to the equivalent <LATIN SMALL LETTER E> + <COMBINING ACUTE ACCENT>. However, to do this natively in Go would require each standalone binary to include a large set of character code table information. Furthermore, there are two equivalence forms defined by Unicode. I have an opinion on which one the Go compiler itself should eventually use for token comparison, but for runtime use neither could serve as the one meaning of the string comparison operator.

Like so much of Go’s design, the way strings work is an elegant compromise that encourages useful idioms without making decisions a programming language shouldn’t. Normalization goes beyond settling the encoding question, and begins a climb up the tall stack of human language concerns. (Check out the ICU project for a sampling of basic Unicode toppings.)

One final note about the implications of Go’s string type: In C, it can be tempting to use string functions on binary data known to contain no inner ‘\0′ bytes. Go’s type system should make this obvious, but use uint8 slices — and never strings — for binary data in Go. Even if your bytes have no terminating ‘\0′ characters, trying to iterate over binary data as string characters will not yield what you expect due to the way UTF-8 encoding works.

January 10, 2008

Playing with Mail and Leopard’s Latent Semantic Mapping

written by natevw @ 6:35 pm

While clearing Mail.app’s junk mail folder, I might have accidentally deleted a non-spam message. I’ll never know for sure, but as a result I learned a bit more about Leopard’s cool new Latent Semantic Analysis framework that I’d been wondering about since the mailing list leaked back in November.

Mail stores its spam information in ~/Library/Mail/LSMMap2. In addition to the Latent Semantic Mapping framework, Apple also provides lsm, a command line utility that provides the same functionality (with a little better documentation, I might add). As described in the man page, you can use lsm dump ~/Library/Mail/LSMMap2 to get a list of all the words that Mail’s spam filter knows about. (Some words probably NSFW, of course!) The first column is how many times the word has appeared in a “Not Junk” message, and the second is the count in spam messages. The last line gives a few overall statistics: how many zeroe values there were versus total values, and a “Max Run” value I don’t understand.

Between this and CFStringTokenizer and its language-guessing coolness, Leopard provides some fun tools for playing around with text analysis. Hopefully someday I’ll have a bit more time to dig into it.

Until then (or rather, for my future reference), here’s a bit more information on Latent Semantic Analysis: how it works and how it differs from Bayesian classification.

I’ve also uploaded a really quick and dirty “playground” for testing out the hypotheses the documentation left me with: lsmtest.m

Update: Came across a more “explainy” article about both Mail and Latent Semantic Analysis over on macdevcenter.com.