September 06, 2004

Why I don't use A9 much

Back in April, John Battelle blogged about it (here and here), Cory Kleinschmidt reviewed it on Traffick, Pamela Parker wrote about it at ClickZ News, and so on. I'm talking about Amazon's A9 search spin-off, and the "search inside the book" facility it offers. A9 offers two kinds of search -- web search, which is just Google's results repackaged, and "search inside the book", which is the main value added as far as I'm concerned.

I was pretty excited about this when it first came out, and I still have some hopes for the enterprise. But as things have turned out, I haven't really been able to use it for much. The number of cases where it tells me something that I hadn't already learned from Google is small, and the number of cases where it tells me nothing of value at all is large.

There seem to be roughly three reasons for this:

First, there are no quoted strings.

On Google, you can search for "to be or not to be" and get 169,000 pages that actually include the quoted string. If you search A9 for a quoted string, you always get no results (in the books category) -- quoted strings just don't work.

For unquoted word sequences, A9's results ranking algorithm seems to give precedence to results in which the words are near one another in the same order. As a result, some such searches work. For example, after Edward Everett wrote to Abraham Lincoln that "I should have been glad if I could flatter myself that I came to near to the central idea of the occasion in two hours as you did in two minutes", Lincoln wrote back that "In our respective parts yesterday, you could not have been excused to make a short address, nor I a long one."

Of course, searching Google for "you could not have been excused to make a short address" returns 8 pages about Lincoln's letter.

Searching A9 for the same words (without the quotes) returns 26,735 pages, of which the top two are relevant:
p. 152 of The Civil War: Stange and Fascinating Facts, by Burke Davis; and
p. 67 of Talking Politics: The Substance of Style from Abe to W, by Michael Silverstein (where I took the quote from originally).
But after that, the results go bad in a hurry. The next few returns are to completely irrelevant pages of Adam Haslett's You Are Not a Stranger Here; Renee Rosenblum-Lowden's You Have to Go to School--You're the Teacher!; and Beverly Engel's Loving Him Without Losing You: How to Stop Disappearing and Start Being Yourself. As far as I can see, it doesn't get any better after that.

Searching for "to be or not to be", there's no cream to skim. The top four results (of 234,822) are:

1. Hugh Hewitt's If It's Not Close, They Can't Cheat: Crushing the Democrats in Every Election and Why Your Life Depends on It;
2. Behrendt & Tuccillo's He's Just Not That Into You : The No-Excuses Truth to Understanding Guys;
3. Woodall et al.'s What Not to Wear;
4. that classic of Shakespearean drama, NOT "Just Friends": Rebuilding Trust and Recovering Your Sanity After Infidelity, by Shirley Glass.

It didn't get any better on subsequent pages, at least not before my patience wore out.

Where do they get these links from? Believe me, they don't reflect a generalization of amazon's experience of my recent own personal book-buying history. Instead, the list seems to be some effectively random amalgam of bag-of-words hits and amazon sales rank.

Second, only a limited subset of books are indexed.

If you search A9 for marthambles, you'll find the example on p. 244 of Dorothy Dunnett's The Ringed Castle that I blogged about (also here and here, if you're interested), and a reference to p. 186 of Dean King's Patrick O'Brian: A Life, but you won't be able to answer that burning question "where in O'Brian's novels does the word marthambles occur?", because his novels aren't indexed.

As of 10/2003, a story reports that amazon had indexed "over 33 million book pages from over 120,000 titles". Presumably more have been added since, though I can't find any more recent counts. However, there must be many more out there to index -- there appear to be about 120,000 (distinct) books published each year in the U.S.

Third, sales rank is rarely a good substitute for page rank.

Search Google for "animal communication" or "first rule of fiction", and you'll find some useful links on the first couple of pages. Skim a dozen or so of the links, and you'll get a pretty good sense of what's going on.

Now search A9 for the same word sequences. In the case of "animal communication", you'll find that three of the top ten results are about telepathic communication with animals, and four others are about how to communicate with your horse or your cat. Three are expensive scientific tomes in which you can't actually read anything except by purchase -- they've been found because animal and communication are in the title. Oddly, one of the links is to Norbert Wiener's classic Cybernetics, which I've read several times without really noticing the subtitle "Control and Communication in the Animal and the Machine". Everyone should read this book, of course, but if you ordered it with the idea that it would tell you anything about animal communication, you'd be sadly disappointed.

In the case of "first rule of fiction", you'll get four references to Terry Goodkind's Wizard's First Rule; a link to Ann Blakely's Never Wear Panties on a First Date and Other Tips; Daniel Magida's The Rules of Seduction; Heather Lewis' novel House Rules; and Ann Rule's novel Possession (!). Enjoy...


Posted by Mark Liberman at September 6, 2004 02:55 PM