October 17, 2007

Programming Language

Ever since my high-school English teacher ran a course on "Electronic Grammar," I've been intrigued by the idea of writing programs to analyze language. Thirty years later the data is more accessible and the programming languages are much easier to use; LanguageLog contains many posts that demonstrate the value of simple programs and plots. The Natural Language Toolkit (NLTK) is designed to make it easy for anyone to write Python programs to access language data and generate tables and plots, and a new version has just been released. NLTK includes a free online book with over 200 graded exercises, including some inspired by LanguageLog such as the above vocabulary growth curve for presidential addresses. It also contains a large software library, 480Mb of data in dozens of languages, interactive graphical demonstrations, and distributions for Windows, Mac OSX and Linux, all free. Although NLTK is now used in over 50 universities, I hope NLTK will go full circle, so that high-schoolers will teach themselves to write programs to analyze language and to test the dubious claims that are often made about language. Posted by Steven Bird at October 17, 2007 08:21 PM