December 12, 2004

Typed citation links

James Tauber at journeyman of some, reacting to Google Scholar, asked

...what if citation indices were annotated with the relationship between the newer publication and what it was citing? You could have relationships like "quotes", "summarises", "provides further evidence for", "argues against", "answers question posed by", and so on.

I agree that having typed document links is a neat idea. A system that already does something like this is CiteSeer, which provides (as in this example) separate lists of related documents along the seven different typed links "Cited by", "Similar documents (at the sentence level)", "Active bibliography (related documents)", "Similar documents based on text", "Related documents from co-citation", "Citations", and "Documents on the same site".

Some of the kinds of links that James suggests ("provides further evidence for", "answers question posed by") are likely to be hard even for human readers to agree about. James notes one of the reasons for this:

The granularity of many articles might not be right for this to really work given that one might argue for one part of an article and argue against another.

There are also other reasons why it might be hard. People don't always see the same evidential connections, nor do they always agree about them when they're pointed out. Automatic procedures for finding such connections will disagree as well. Textual similarity relations of the type that CiteSeer uses for (some of) its links are similarly fuzzy, but most users recognize that fact, I think, and are easily able to use such relations for what they may be worth. Logical-sounding relations like "provides evidence for" might be (mis)taken more seriously. Still, I'd like to be able to explore a link graph that included such relations.

James muses that bloggers could start using typed hyperlinks as part of the process of composing entries:

I wonder if it might be more practical in blogs. People could link to this entry with annotations like "agree", "agree with additional ideas", "agree with caveats", "seen something like this already", "really dumb idea with reasons stated".

As James seems to intend, I guess this could be done using a sort of souped-up version of the rel attribute on a or link elements. But this seems to require an agreed-on controlled vocabulary for such link types. If you let people add arbitrary comment-like annotations to such links, you'd get a large space of variants, both roughly equivalent ones like "really dumb idea with reasons stated", "silly idea, here's why", "follow this link to learn about the problems with this foolishness", "nonsense, for elaborately documented reasons", and also clines shading off in various directions like "apparently obvious idea that turns out to be false for interesting reasons", "idea that seems dumb at first but is actually profoundly true", and so on. Clustering techniques could be used to establish some sort of structure over these annotations, but you could do that with the original text around the link, without the writer-supplied metadata.

When link types are treated as information supplied by writers rather than information creating by an indexing process, the game changes. Indexers can structure a set of documents in lots of different ways for lots of different purposes, and still invent new types of links to try tomorrow. Users of such indices can pick and choose as they please in this evolving garden of relationships. But a set of link types to be used by writers can't be used by readers unless all the writers they care about have used roughly the same ones, in roughly the same way.

The trick would be to create a taxonomy of link-types -- or as James implies, link-dimensions -- that's close enough to orthogonal to span an interesting space of relationships, expressive and flexible enough to be useful to different sorts of people doing different sorts of things, and simple enough for most people to be willing to learn it and use it.

You could argue -- probably someone already has -- that html has been so successful because it's at a sweet spot of semantic incoherence, which allows writers to adopt vague and variable theories about what it all means, and readers to reconstruct some approximate analog of the writers' intentions. Just like language, maybe, but that's another story. I like the idea of adding some link types to the system, but it won't be easy to do it in a way that works.


Posted by Mark Liberman at December 12, 2004 06:04 AM