November 18, 2006

Mrs. Olsen gets a D

John Vann sent in this Frazz strip, which ran 11/17/2006:

John's comment: "One might append '... or the Language Log'."

In this case, there's a Wikipedia article, which discusses the situation from many angles -- and provides ammunition for smart-mouthed elementary-schoolers everywhere by giving a long list of exceptions, including "oneiromancies", which breaks the rule twice, once in each direction.

And in fact, this rule (whatever its pedagogical value) performs badly as a predictor of English letter sequences, because of the high frequency of words like "their", "science" and Germanic names like "Einstein" and "Bruckheimer". In two random stories from today's NYT, I count:

  [^c]__ c__
ie
29
5
ei
11
0

This is a total of 29 right vs. 16 wrong, for a grade of 64 on a scale of 100, or a D.

If we evaluate the performance in the terms usually used in modern AI, machine learning and similar disciplines, we'll get an F-measure of .78. I calculate this by defining the problem as predicting cases in which 'i' precedes 'e'. Then we can re-label the table in terms of predicted and observed positive ('i' before 'e') and negative ('e' before 'i') instances:

  yes
(observed)
no
(observed)
yes
(predicted)
29
11
no
(predicted)
5
0

Then the "precision" of the test (otherwise known as "positive predictive value") can be calculated as the number of true positives divided by the sum of true positives and false positives, which here is 29/(29+11) = 0.725. This is the proportion of the time that the rule is correct when it predicts a positive outcome, i.e. that 'i' precedes 'e'.

And the "recall" of the test (otherwise known as "sensitivity") is the number of true positives divided by the sum of true positives and false negatives, here 29/(29+5) ≅ 0.85. This is the proportion of the observed positive outcomes (i.e. where 'i' precedes 'e') that is predicted by the rule.

We usually take the harmonic mean of these two figures in order to get a combined score known as the "F-measure", which here is 2*.85*.725/(.85+.725) ≅ 0.78.

This might look a bit better than the elementary-school grade of D -- after all, you'll find plenty of machine-learning papers, in the best journals and conferences, with F-measures in the upper 70s. However, the referees don't let these papers get by without comparing their performance to the obvious trivial baselines, such as predicting the commonest outcome all the time. In this case, that amounts to the rule 'i' before 'e' no matter what -- and this rule actually works quite a bit better:

  yes
(observed)
no
(observed)
yes
(predicted)
34
11
no
(predicted)
0
0

Now we get precision of 34/(34+11) ≅ 0.76, and recall of 34/(34+0) = 1.0, for an F-measure of 2*.76*1.0/(.76+1.0) ≅ 0.86.

In terms of elementary-school grading, that would be 100*34/(34+11) ≅ 76 -- a solid C.

So Mrs. Olsen's rule, however hallowed by tradition, is empirically pathetic.

[Update: a reader wrote to complain that two random news stories is not a very big sample. So I wrote a little program to calculate the numbers for a random month of NYT newswire, from 2001 (a total of about 8.7 million words):

  [^c]__ c__
ie
110,430
7,405
ei
53,241
3,640

Predicting i before e except after c gives us:

  yes
(observed)
no
(observed)
yes
(predicted)
110,430
53,241
no
(predicted)
7,405
3,640

for precision = 110,430/(110,430+53,241) ≅ 0.67, and recall 110,430/(110,430+7,405) ≅ 0.94.

The F-measure is then 2*0.67*0.94/(0.67+0.94) ≅ 0.78.

So the precision was lower, the recall was higher, but the F-measure was the same.

And the grade? (110,430+3,640) = 114,070 right, (53,241+7,405) = 60,646 wrong, for a grade-school grade of: 100*114,070/(114,070+60,646) ≅ 65% correct. Again, a D.

And the alternative rule i before e no matter what?

  yes
(observed)
no
(observed)
yes
(predicted)
117,835
56,881
no
(predicted)
0
0

Now we get precision of about 0.67, and recall of 1.0, for an F-measure of 0.81. Not as good as before, but still better than the conventional rule. The "grade" of 117,835 right, 56,881 wrong, or about 67% correct, is also a bit better than the grade of the conventional rule.

Of course, any bright fourth-grader ought to be able to work out a simple rule that works a lot better than either of these. (Hint: supplement the default order with a list of the N commonest exceptions...)]

[Mark Baker raises a different point:

I've always heard the rule as "I before E except after C, when the sound's E"; I didn't think anyone had ever suggested that the rule might apply to things that don't have an E sound until I saw people discussing it online. There are still lots of exceptions, but many less: does this now outperform your alternative "I before E always" rule?

The wikipedia article (which I linked to above) offers two augmented versions, identifying Mark's as "British":

An augmented American version is:

i before e
except after c
or when sounding like a
as in neighbor and weigh

which excludes many of the exceptions but still fails to correctly handle many others.

A lesser known addendum in America is: Neither financier seized either species of weird leisure.

A British version is:

when the sound is ee
it's i before e
except after c

which excludes most exceptions, as well as excluding some words (e.g. friend) which are correctly handled by the American version. The most frequent everyday failures of the British form of the rule are seize, caffeine, protein and, for those who pronounce the initial vowel sound ee, either and neither.

Obviously the expanded versions are going to work better. However, since they're dependent on the alignment between spelling and prounuciation, it's going to be harder to score them. And since they make predictions about different sets of cases, using different numbers of clauses of differing complexity and generality, it's not easy to compare their scores.

In any case, the point about the Mrs. Olsen's of the world is that they promulgate such rules not because they accurately describe the facts, but because of some long-ago assertion felt to have been authoritative.]

[And in a comment over at Pharyngula, Oolon Colluphid posted this:

"I" Before "E" Except After "C"
by Duncan McKenzie

It's a rule that is simple, concise and efficeint.
For all speceis of spelling it's more than sufficeint.
Against words wild and wierd, it's one law that shines bright
Blazing out like a beacon upon a great hieght,

It gives guidance impartial, sceintific and fair
In this language, this tongue to which we are all hier.
'Gainst the glaceirs of ignorance that icily frown,
This great precept gives warmth, like a thick iederdown.

Now, a few in soceity choose to deride,
To cast DOUBT on this anceint and venerable guide;
They unwittingly follow a foriegn agenda,
A plot hatched, I am sure, in some vile haceinda.

In our work and our liesure, our homes and our schools,
Let us follow our consceince, sieze proudly our rules!
Will I dilute my standards, make them vaguer and blither?
I say NO, I will not! I trust you will not iether.

]

[And this from Stephen Jones:

The wikipedia article is American, and thus biased against the British rule:
'i before 'e'
except after 'c'
when the sound is 'ee'.

which actually works for all but a handful of words ('seize' and 'protein' and 'Sheila' are the only ones I can find doing a search of the SOED).

Now, Wikipedia suggests that British pronunciation of 'sheikh' and 'either' is the result of applying the spelling rule to pronunciation. I am most dubious of this. It seems much more likely to me that the rule was imported to the USA from the UK, altered because of differences in American pronunciation, but, like most prescriptive rules never discarded.

For my pronunciation it is the most effective spelling rule I know.

]

Posted by Mark Liberman at November 18, 2006 05:37 AM