January 14, 2004

Linguist's Search Engine teaser

Mark's note on Google sociolinguistics kindly provided me with an advertisement, not to mention a gentle nudge to start writing here, something I've been meaning to do for a while. Appropriately, therefore, my first posting has to with language data on the Web.

Mark asks whether refute that such-and-such is so is a construction that is coming or going. For some purposes, such as investigating the time course of a particular usage, I've found Altavista more convenient than Google, particularly because one can take advantage of the "advanced search" feature to access useful metadata. (Also, for linguistic purposes, I'm not sure preferring pages with a high Google rank is particularly helpful.) For example, when Mark looked for "refute" with sentential complements, he could have replicated his Googling using Altavista advanced search for "refuted that the" OR "refutes that the" OR "refute that the" with the date unconstrained, in order to get a rough idea of how Google and Altavista compare, and then done a time-based comparison by restricting the query to the time frame from 01/01/95 to 12/31/98 (18 hits) versus the time frame from 01/01/99 to 12/31/03 (739 hits).

As a matter of fact, a few years ago I did this same sort of operation for a pre-sentential exclamation I'd been noticing more and more frequently: ``woo hoo!''. (E.g. Woo hoo! I won!) At the time, my search using Altavista provided an estimated 15 instances of this expression in total prior to 1996, 144 in 1996, 459 in 1997, 2269 in 1998, and 6676 from January to August 1999. These were raw counts, but further analysis showed that the usage of this phrase increased two orders of magnitude even when counts were normalized to account for the growth of the Web. (At the time I used Web host count data at http://www.isc.org/, which have been saved from oblivion by the Internet Archive. Bless you, Brewster.) A bit more detective work led me to the probable origin of the expression, or at least of its increased popularity [sound clip]. This may not have been a linguistically deep example, but it did help convince me how powerful the Web could be, potentially, as a resource for data about language in use.

The trouble was, Web search engines were not -- and still are not -- well suited to the needs of the ordinary working linguist. If you're able to approximate a phenomenon using a contiguous sequence of words like refute that the, great. If, on the other hand, you're interested in looking on the Web for a phenomenon involving syntactic structure that is not easily approximated in this way, you're out of luck.

As an example, someone once commented that a model of mine predicted (incorrectly, he thought) that the verb titrate should be grammatical with an implicit direct object. For example, "You should stop titrating" should be able to mean "You should stop titrating whatever it is that you are titrating" the same way that "You should stop eating" can mean "You should stop eating whatever it is that you are eating". I agreed with the prediction, but I had no intuitive judgment of my own. Where to find lots of people for whom titrate is an active vocabulary item? One obvious step was to look on the Web for this particular verb used intransitively -- or better yet, find a way to search more generally for sentences according to linguistically relevant lexical and/or syntactic criteria.

That didn't exist... so, to make a long story short, a few years ago I decided that such a thing needed to be built and convinced NSF that this was a good idea, and we built it. ("We" being a team at the University of Maryland, most notably a brilliant software designer/programmer named Aaron Elkiss, with input from collaborators Mari Broman Olsen and Christiane Fellbaum.) We call it the Linguist's Search Engine, or LSE for short.

I'm not going to say much more about the LSE in this post -- that's why the title says it's a teaser. Why am I doing this? Well, I would have liked to be giving out the URL for it right now, but in mid-December the LSE server was one of nine computers stolen out of an office at the University of Maryland (!!). Everything was backed up, fortunately, but it took a while to get a new machine and restore everything, so we had to push our going-public date back a month or so. I hope to have the URL for you next week.

Meanwhile, though, let me close with a few examples I found quite easily using the LSE:

  • http://www.ocdsb.edu.on.ca/JMCCweb/PROJECTS/SCIENCE/ChemLabs/ABTitr.html
    ...In this experiment, you will use a computer to monitor pH as you titrate.
  • http://ww2.lafayette.edu/~bonos/Week1.htm
    We only titrated with phenolphthalein if the pH was above or equal to 8.3.
  • http://drnelson.utmem.edu/med2.html
    Conversely, if we titrate in the opposite direction...
  • http://misterguch.brinkster.net/q9.html
    The endpoint of a titration is when the indicator tells you should stop titrating....

Woo hoo! Posted by Philip Resnik at January 14, 2004 10:01 PM