April 09, 2006

The meaning (or not) of links

In the word of weblogs, textual links are either demonstrative, like this, or they're footnote-like provisions of background information that the author thinks might be helpful. (I'm leaving out navigational linkage, which is often quite different.) Those hyperlink conventions are pretty much the ones that developed earlier in other sorts of web text. But over the past few years, newspapers have started experimenting with hyperlinks in their news text; and in my opinion, the results are generally weird.

Several years ago, the New York Times started tagging company names with links to company-associated pages on the NYT business site. But there was no serious attempt to ensure that the tagged names actually have any connection to the company: thus back in March of 2004, I noted a case where "Laura Fluor, a car saleswoman from Monmouth County, N.J" got a link to the Fluor Corporation.

Since then, the Times has either instituted better entity-tagging algorithms, or put humans in the loop: today I can read a dozen stories on line without finding any examples like that one. However, the links are still a little strange.

For example, a story by Joseph Berger on the rising prices of suburban homes ("Homes Too Rich for Firefighters Who Save Them") has three hyperlinks. One is to "Steve Levy" in this context:

Steve Levy, the Suffolk County executive, said the problem went beyond civil servants.

The second one is to "Harvard":

"There are parts of the country, particularly the two coasts, where the price of housing has so outstripped any income gains that moderate wage earners find it difficult to find a decent home in the community where they work," said Nicolas Retsinas, director of the Joint Center for Housing Studies at Harvard and a former assistant federal secretary for housing.

The third one is to "Martha Stewart":

In the town of Bedford, made up of the hamlets of Bedford, Bedford Hills and Katonah, the median household income for its 18,600 residents is more than $100,000, with celebrated residents like Martha Stewart making a good deal more. Volunteers are increasingly coming from outside Bedford's bounds.

Levy is arguably an important figure in the context of the story, deserving the hyperlink equivalent of a footnote. But the references to Harvard and to Martha Stewart are tangential, and the existence of the hyperlinks implicates a degree of relevance that they lack. The links are especially odd, given that many more relevant "named entities" are not given links -- places like Westchester County, organizations like Habitat for Humanity, and so on.

It's pretty clear what's going on. There's an index of Times Topics, which "correspond to the most frequently assigned subject, geographic, organization and personal name headings". Stories are indexed (automatically?) relative to that (finite and fairly small) list of topics. Thus Berger's story on suburban home prices is linked to Harvard, even though the only connection is a quote in the 14th paragraph from someone who works there; and to Martha Stewart, even though she is only mentioned in the 25th paragraph as an example of one of the wealthy residents of the town of Bedford.

Sometimes the Times Topics links seem pragmatically even stranger. Today's 7,800-word NYT Sunday Magazine piece by Jack Hitt on abortion in El Salvador, "Pro-Life Nation", has links on the words and phrases abortion, pregnancy, mental health, Pope John Paul II, U.S. Supreme Court, suicide, ulcer, hepatitis, U.S. Senate, and smoking, although many of these are entirely marginal references, e.g.

The women's prison where convicted murderers are sent is in the outer district of Tonacatepeque. ... Through a small window, I could see an open area crisscrossed by laundry lines and arrayed by different women lying around smoking.

There are no links for many more relevant items in the same story: El Salvador, South Dakota, Roe v. Wade, Opus Dei, Center for Reproductive Rights, Yes to Life Foundation, and so on.

But it could be worse. Following some links in my Sunday morning reading on the web, I happened on Kaelen Wilson-Goldie's review in The Daily Star of Brian Whitaker's "Unspeakable Love", published under the headline "Briton's book gives voice to gay Arabs". It appears that the (Beirut) Daily Star has started selling words in its stories, somewhat in the way that Google sells AdWords relative to users' search terms. I've seen this in the online edition of other publications as well, but The Daily Star doesn't just put perhaps-relevant ads in the margins; it underlines the words and then flashes the ad as a mouse-over event. The first paragraph of the review contains four ad-linked words:

When Salim, a 20-year-old Egyptian, told his family that he was gay, they packed him off for six months of psychiatric treatment. When Ali, a teenager from Lebanon, was discovered to be gay, his father broke a chair over his head and his brother threatened to kill him for tarnishing the family honor. Ali left home and no longer has any contact with his relatives.

And the (mouse-over pop-up) ads for those words start:

family Find Family Practice Opportunities: At M****** M****, we match experienced physicians with respected health care organizations. ...
Lebanon Visiting Lebanon? Find cheap flights and hotel rates ...
chair Find a wide selection of lift chairs, various colors and sizes ...
home Get money-saving tips by taking ***'s home energy survey.

I doubt that "... his father broke a chair over his head..." is really the sort of context in which the company selling lift chairs really hoped to find customers.

It's clear that the algorithm is simple keyword match. Thus words are matched inside names, so that the House inside Zico House

Launched in Beirut on Wednesday night with a book signing at Zico House and a party at Walima...

rates an ad that starts "What's your home worth? Thinking of selling your home? Wonder how much it might be worth? You can find out with a free home valuation at ..."

And words are also taken out of idioms, e.g. credit in

To his credit, Whitaker does not shy away from but rather dives into the murky questions surrounding homosexuality in the Middle East.

which yields "Finding the best credit card deal is now easy ..."

So here's a new application for text-tagging algorithms: not doing information extraction for "data mining" in text, but rather finding textual references that are genuinely appropriate for triggering advertisements.

[If you're thinking of patenting this idea, consider this post to be evidence of prior art, since the implementation of the idea (to some reasonable degree of performance) is a trivial application of the existing technology of stochastic taggers. ]

[Update: Matt Hutson has an example showing that the NYT has not yet worked all the kinks out of their links:

A November 6 NYT Mag article on literary Darwinism mentions Harvard psych prof Stephen Kosslyn. The word "Stephen" is hyperlinked to the Times Topics page for Stephen Sondheim.

]

Posted by Mark Liberman at April 9, 2006 05:42 PM