In the article “The Nature of Indexing: How Humans and Machines Analyze Messages and Texts for Retrieval. Part II: Machine Indexing, and the Allocation of Human Versus Machine Effort,” Anderson reviews the effectiveness of human versus machine approaches to indexing. Advantages and disadvantages to both are discussed, as well as recommendations on how to best allocate using human versus machine indexing. Problems are encountered when the words being indexed are difficult to decipher by machines, and treating all documents equally.
Machine indexing is analysis of text by computer algorithms. Indexing in this manner is based on a word, defined in this manner as one or more characters separated by spaces or punctuation. Problems come when punctuation is not a separator of words, but part of a word, like apostrophes, dashes, and slashes. Researchers have suggested defining a word as a sequence of characters without regard to spaces or punctuation. Problems are also realized with single character words, like “I” and “a,” and retaining capital letters. Automatic indexing began with retrieving exact occurrences of the characters a user inputs into a search. An attempt at reducing the output from a search came by using negative vocabulary control: a “stop list” of insignificant words (i.e. “the” or “an”).
Results of searches were sorted by counting the frequency of the search words in a document (term frequency: TF), and then weighing the frequency by sorting by frequency across documents in a collection (inverse document frequency: IDF) in further attempts to return more relevant results in searches. Further improvements on the counting and weighing was stemming, or removing common prefixes and suffixes (re- or –er, or -s), though sometimes this would change the context of a word (index versus indexer versus indexing). Procedures to keep certain terms together (junior college versus a junior in college) have been expensive and time-consuming, so is still in research phase. Clustering is being analyzed as a good producers of relevant documents, which is groups of words or documents together based on co-occurrence of two search terms. Related documents are sometimes “suggested” to searchers through clustering. Latent semantic indexing (LSI) is a more advanced clustering technique that eliminates problems with homonyms.
To improve indexing effectiveness, documents should not be treated equally important. It is suggested that human indexers augment important documents so they are more accessible. The goal: all search returns relevant documents. Bradford’s Law describes great dispersion of relevant documents – even with specific term searching, getting returns of maybe 50% relevant documents. Some argue that using human indexers to index only what they deem as important is censorship. Zipfian distributions describes several patterns of irrelevant document returns. The best human indexing will not result in 100% relevant documents returned, but design tools are needed to help users sift through the weeds more efficiently. More important documents can be identified by: 1) documents used at a high rate; 2) high rate of citations; 3) documents in physical print; 4) awards won; 5) searcher nomination; 6) documents that had advisory boards; 7) documents identified as important by indexers; 8) exemplary documents (though these would be marked from indexers in #7).
Anderson attempts to encourage researchers to come back to basics – determine how machines should index so that we can meet the goal of a user finding the document they are looking for. While he has won me over to thinking that research on this deserves time, his article his speckled with items that are not explained well, and perhaps should not be listed at all. Early on, he mentioned that two major models of automatic indexing have emerged over its time: the vector-space model, and the probabilistic model. There is little to no explanation of why these were listed. Wouldn’t an author or publisher want to give the best indexing terms for their documents so users could find them? Much is still to be learned on indexing, and this article raises more questions than answers.