December 06, 2004

Lexemes and word forms

Language Log readers who are sharp of eye and typographically on the ball — the sort of readers who can tell one font from another, and thus tend to refer to Dan Rather's embarrassing Microsoft Word-processed Texas Air National Guard memos as "forged" rather than "of disputed authenticity" &mdash will have noticed that I sometimes cite words that I mention in a post by putting them in italics (like this), but then sometimes I put them in bold italics (like this). I should have explained this notational convention long ago. I did actually touch on it accidentally in another context once (in this post), but you could be forgiven for having overlooked it, since that post was primarily about trademark law. But anyway, my usage does not display random variation in font style selection. There is a semantics to it. The needed explanation follows.

Let's look at a typical case, from my post "Ray Charles, America, and the subjunctive":

. . . when you hear crown you have your crucial piece of evidence. The preterite of crown is crowned, so the line And crown thy good with brotherhood cannot be a preterite.

Why the first occurrence of "crown" in italics, the second in bold italics, and then "crowned" in italics again? The answer is that the font style distinction is systematically used to reflect a conceptual distinction: word forms are being distinguished from lexemes.

A lexeme is a word in roughly the sense that would correspond to a dictionary entry. Lexeme names are given in bold italics. The point about "crown", for example, is that as a transitive verb it would get one entry despite the existence of four different shapes in which it appears: crown, crowns, crowned, crowning. These different shapes spell out word forms that belong to the verb lexeme crown. In a big and detailed dictionary they would all be listed in the single entry for crown. (In shorter dictionaries you would just be expected to know that the word forms for a regular verb like crown would be crown, crowns, crowned, crowning, the word forms for a regular verb like walk would be walk, walks, walked, walking, and so on: they list the lexemes, you are meant to know the grammar.

There would be another lexeme in the dictionary for "crown", of course: a noun lexeme crown. Its word forms would be the plain singular crown, the plain plural crowns, the genitive singular crown's, and the genitive plural crowns'.

This notational convention emerged first in the work of Rodney Huddleston and is used systematically throughout The Cambridge Grammar. Occasionally it is suppressed when drawing the distinction would distract rather than clarify. So, for example, in my "Those who take the adjectives from the table" I say this:

How could "one of the few points on which the sages of writing agree" possibly be that "it is good to avoid them" when to utter the very thought you need the adjective good? How could William Zinsser possibly be serious in saying that most adjectives are "unnecessary" when he couldn't finish his sentence without the adjective unnecessary?

Here I actually mean the adjective lexeme good, which has the word forms good, better, and best. Using better would count as using the adjective good, though in its comparative inflectional form. But I the next adjective mentioned was unnecessary, which does not inflect for comparison: there is no *unnecessarier or *unnecessariest. I thought it would look distractingly odd to put just good in bold italics. I therefore didn't add that pedantic detail. Nothing about its inflectional forms was relevant to what I was saying. But in general, whenever there could be confusion about whether I meant a word form or a lexeme, I will use the distinction in font styles, and always in the same way.

For words that have only one shape the distinction between lexemes and word forms makes no sense (for a language that truly has no inflection at all, one wouldn't draw the distinction), so the minimum number of word forms for a lexeme would be two. That minimum is represented in English by verbs such as must and ought, which are modal verb with no preterite (inflected past tense). The shapes of the two word forms of must are must (present tense neutral) and mustn't (present tense negative).

Which English lexeme holds the record for most word forms? The answer is be. The absolute minimum number of separate word forms it has (assuming no distinct word forms that have the same shape, but counting the informal-style negative variants as word forms) is 12: am, are, aren't, been, be, being, is, isn't, was, wasn't, were, weren't.

In some languages (Sanskrit, for example) the number of word forms for a verb lexeme is in the high hundreds, and for some others (Turkish, for example) it is certainly in the thousands.

Posted by Geoffrey K. Pullum at December 6, 2004 07:42 PM