July 18, 2006

Pixie dust for digital divination: an open letter to Gary Flake

Yesterday I heard an interesting talk by Gary Flake here at the "Microsoft Research Faculty Summit". His title was "How I Learned to Stop Worrying and Love the Imminent Internet Singularity", and the subtitle was "Why right now is the best time in the history of the universe to be a computer scientist". You can read his slides by clicking on the link behind his title. I enjoyed the talk, including what he said at the end about his plans for Microsoft Live Labs, and I especially liked the idea that he described (on slide #34) as "Use extra resources as pixie dust". He was talking about trying to do difficult Microsoft-internal magic, like connecting research with products, but it made me think about a different hard problem, also much discussed yesterday: the "dramatic decline ... in the number of CS and IT graduates", and the "impending loss of national competitiveness". This inspired me to express, in the form of an open letter to Gary Flake, some ideas that have been circulating for a few years among computational linguists.

The world is writing itself down on the web, and those who can log and index and search and count can read it. At least, they can read it if they're literate in the mathematical languages of digital divination, and able to conjure up the right algorithmic spirits.

Search engines offer everyone a peek at the possibilities, although most people don't have the right data, or the right tools, or the right skills, to take more than a few stumbling steps down this road. Nobody has traveled more than the first few miles. But every bright kid sees where the road is going, and wants to get there. This is a fantastic research opportunity, and an even more spectacular educational opportunity.

Three things are missing: data, tools and knowledge. Grabbing an interesting chunk of the web is a significant chore, and data like query logs are only available to those with popular search engines. It's a bigger chore to create an environment where skilled divinators can easily ask -- and quickly answer -- interesting questions about the mathematical entrails of a web snapshot. And the biggest barrier for would-be researchers is acquiring a practical grasp of the divinatory arts: not just programming, but statistics, information theory, formal language theory, graph theory, complexity theory and more.

But suppose you collected some big chunks of net-stuff and connected them to the right kind of search environment -- clever indices for some things, general map-reduce analysis for others. Add a collection of interesting, accessible howtos and sample scripts for exciting problems, and throw open the gates. Academic researchers in fields like computational linguistics would rush in and start playing. Within a couple of years, this stuff would be featured in courses from high school to graduate school, in areas from linguistics and psychology to statistics and computer science. Individual students would start winning science-fair competitions with projects from this world.

The key that opens the gate is human language technology, because most of the web's meaning is in its words. You'd have to build a lot of HLT tools into the environment, and you'd have to teach people how to use them. But the ultimate object of study is not language -- though you'd certainly learn a lot about the world's languages along the way -- it's the world that language explores, describes and creates.

You could have a portable mini-version that anyone with a spare terabyte could run; a bigger system requiring a modest cluster; a huge system, hosted on somebody's servers, offering limited capabilities to the general public; a huge system, offering spectacular resources of space and time, but limited to those (students and faculty) who are invited in. There could be specialized versions: the USPTO archives; the biomedical literature; the Enron files; the Congressional Record; blogs and web forums; the complete archives of the U.S. Supreme Court.

You'd attract hundreds of thousands of of students into computational science and engineering; you'd help advance research in several fields; and you'd win a lot of friends for your company.

Posted by Mark Liberman at July 18, 2006 10:25 AM