I'll add one to this morning's character-encoding theme. Yesterday, Patrick Hall at Blogamundo posted an interesting reaction to our recent Name That Tune's Language series, "How do you Google something you can't spell?". He observes that a web search for the song's lyrics based on the correct Romanian orthography fails, while one based on an English-inspired asciified re-spelling succeeds. He points out that similar problems are common in searching for Hindi on the web, and cites a paper that suggests an approach based on n-gram matches.
There's a cluster of loosely related problems here: dealing with variant (and wrong) spellings; dealing with "eye dialect"; dealing with alternative encodings (e.g. GB vs. Big 5 vs. Unicode for Chinese characters, or the multiple proprietary encodings of Hindi newspapers); dealing with languages whose orthographic system is not standardized (e.g. Somali or Elizabethan English); dealing with transliterations that are semi-systematic and/or in multiple systems (e.g. Arabic into English or English into Chinese); and so forth.
In addition to the ideas that Patrick references, let me cite the following without further comment (for now): Soundex, agrep and (especially) TRE; and Google Scholar queries like {fuzzy matching transliteration} and {levenshtein transliteration}.
Posted by Mark Liberman at June 4, 2006 10:22 AM