Language Log: Is "language modeling" linguistics?

July 17, 2006

Is "language modeling" linguistics?

In reponse to my post about the relative public awareness of linguistics and other disciplines ("Psychology ≅ 10-100 x Linguistics?"), Vili Maunula of Lingformant wrote to point out another way to get data on this: Google Trends. You can get a comparative display -- what it shows is qualitatively consistent with the quantitative estimates that I developed earlier ("psychology" as a search term in red, "linguistics in blue"):

Meanwhile, here at the "Microsoft Research Faculty Summit 2006", I'm looking forward to Ken Church's update to his 2005 ACL presentation (Kenneth Church and Bo Thiesson, "The Wild Thing" ACL 2005). Ken, who has always had a flair for public relations, is using the catchy title "Wild Thing Goes Mobile and Local" (well, the competition is titles like "Recent Progress on Sensor Networks and Embedded Computing", or "Bioinformatics on the Windows Platform"):

Typing is a pain, especially on a phone. Suppose you want to search for “Condoleezza Rice,” but you don’t know how to spell her name. And even if you did, you wouldn’t want to type that much, especially on a phone. With the Wild Thing, the user types “c rice” or “2#7423” on the phone. This pattern is short hand for the regular expression: /c.* rice.*/ or /[abc] .*[pqrs][ghi][[acb][def].*/ on the phone. The system uses a language model, based on MSN query logs, to find the k-best (that is, most popular) expansions of the regular expression. To find hot stuff, or your favorites, you shouldn’t have to type a lot, especially if we know your location. The Wild Thing raises some interesting — and fun — technical challenges for language modeling research. What is the probability of all queries in all locations? MSN has lots of data, but they haven’t seen every query in every location. How do we smooth language models over geography?

Ken is using the term "language modeling" in the standard technical sense: "estimating the a priori probability of different character sequences". Is "language modeling", in this sense, part of the discipline of linguistics?

The logical answer, in my opinion, is "yes". We're talking about empirical estimates of the parameters of a formal model of the relative probability of different linguistic sequences, in geographical and social context, used to guess someone's communicative intent: surely this is part of the science and technology of language. More precisely, I guess, we can classify it as "applied psycholinguistics".

But the practical answer, alas, is "no". Most people who work on "language modeling" consider themselves to be computer scientists (or sometimes electrical engineers), and their credentials and their organization names generally reflect that. In 1983, I hired Ken into something called the "Linguistics Research Department" at Bell Labs in 1983 -- but he'd just gotten his PhD in "computer science", not "linguistics". And "Linguistics Research Department" is not an entry in very many corporate directories.

I'm not arguing for disciplinary exclusion -- in a logical world, there'd still be plenty of room for mathematicians and computer scientists and psychologists and sociologists to study language, and such interest should continue to be welcomed. Instead, I'm trying to make a point about disciplinary inclusion. "Language modeling" is now mostly outside of "linguistics" because of a series of historical accidents, not because this way of drawing disciplinary boundaries makes sense. The world would be better if things were different.

Posted by Mark Liberman at July 17, 2006 02:39 PM