It has been said, “There is no substitute for experience, but letting your wife do it is the next best thing.”1 This colloquialism expresses an idea that is more profound than an initial reading might suggest. The idea is that a personal, first-hand, internalized knowledge of information is ideal since it is entirely available to the individual at the point of need – assuming, of course, that it can be remembered. Otherwise, however, the presence of a substitute that points an individual to the needed information is the next best thing. In the real world, however, such substitutes become the practical ideal since not everyone has the same knowledge or vocabulary. The illustration here is clear: the use of surrogate records to point to information resources is, for a multiplicity of reasons, the most practical and therefore the best only real solution to the problems inherent in information representation and access.
The popularity of many full-text databases is likely attributable to their seeming ease of use, though, ironically, the simpler user interfaces usually require more non-intuitive and advanced knowledge to search effectively. Anyone can enter “jaguar” into Google’s single search box, but not many know how to limit the results to either the car, the old Mac operating system, or the animal. Yet, convincing a searcher that there are better, more efficient, ways to arrive at a desired set of results is not an easy task.
One of the impediments to successfully convincing searchers to learn what they consider to be needlessly complicated and irrelevant search syntax when using full-text databases is convincing them that using an intermediary layer between them and the text (or other information resource) is often more efficient. Understandably, most searchers balk at the thought of distancing themselves from the information in order to find it. It seems counter-intuitive. Who are we, anyway, to dictate the terms under which they can access information? Herein lies the rub, however. Without a system that quite literally does exactly that, most information resources will be less likely identified by the majority of searches. There are too many difficulties inherent in present-day full-text indexing methods for searches yield accurate and comprehensive results, and someone must indeed dictate the terms under which a resource can be found.
Full-text indexing is accomplished automatically, that is, it is a computerized process that extracts terms according to a defined algorithm. The process can be rather complex but is really rather simple in its conception: lexical analysis and term selection. Lexical analysis is the process by which formatted, punctuated, inflected text is dismantled into unformatted, uninflected, words. These tokens, as they are frequently called, then undergo the term selection process in which certain stop-words are removed. Some words are “stemmed,” or truncated, to remove any inflection from their verbal roots and to group lexically related words under their simplest form. Others, such as hyphenated words, are broken into their constituent parts. The terms are then “weighted” to determine their relative importance based, usually, on their frequency of occurrence.
The benefits of this type of indexing are, in my judgement, few but important. Full-text indexing is inexpensive and is becoming increasingly so. This is no small benefit. Libraries are chronically under-funded, and the bottom-line is always a concern. Database vendors, the primary producers of such databases, are for-profit businesses. Taken together, under-funded libraries and profit-driven vendors are constantly engaged in a tug-of-war as each pleads their case. Full-text indexing, though often a high-cost initial entry endeavor, appeals to both for the same reason: it is affordable.
The second important benefit to full-text indexing is that it removes the inconsistencies that result from the use of manual indexers. Spelling variants between indexers (color or colour? indexes or indices?) as well as the inevitable inconsistencies that a single indexer may apply are avoided with an indexing algorithm’s prescribed procedures. They will be followed correctly every time. Consistency is no small benefit either. Without it, the architectonic purpose of indexing is nullified.
These benefits are important. Taken together with the increasing expectation by searchers for full-text search capabilities, a strong argument is made for the implementation of full-text indexing of information resources -– especially of textually-based resources. Lest we rob Peter to pay Paul, however, there are further considerations to be had.
A surrogate record is “a presentation of the characteristics . . . of an information resource.”2 When referring to surrogate records in a catalog of bibliographic resources, this metadata typically includes three primary types of information: descriptive data, subject data, and classification data. These records are used to help render the resources for which they stand as intermediaries more identifiable to searchers. They do not provide the resource per se, but point to the resource. These records are no longer singular in their directionality, however. Rather, properly created surrogate records provide multiple points of access to the resource through the fields such as subjects and classifications, as well as the author’s name and the resource’s title. Indeed, the access points in contemporary surrogate records render the record multidirectional, and allow the resource to be identified via several avenues.
The crux of this argument lies in the appropriation of controlled vocabulary – a process which heretofore has proven elusive to automatic methods. Controlled vocabulary in a surrogate record includes the normalization of spelling, the assignment of preferred terminologies in order to address homographic and synonymic issues, and thereby reduces ambiguity. For example, without some terms being dictated one would not know whether to look under “C. S. Lewis” or “Clive Staples Lewis” as an author. The task of pursuing both in full-text searches becomes cumbersome without complicated syntax. The application of an authoritative term is really quite valuable.
Homographic problems are also illustrative of the usefulness of surrogate records. Does “Mercury” refer to the planet, the metal, the automobile, or the mythological god? Full-text indexing has no way to differentiate them. Controlled vocabularies have devised a multiplicity of solutions, and in the case of subject classification and its manifestation in a catalog’s surrogate record for a bibliographic item, render resources on each of these possibilities uniquely identifiable.
Such precision is perhaps the strongest benefit of this approach. This precision, however, is important enough to outweigh the potential weaknesses of this approach. Admittedly, indexing to produce surrogate records with controlled access points allows for the potential for a number of lesser problems. Foremost among these problems is cost. At present, no automated process is sufficient for the task. This lack of automation requires that controlled vocabularies be appropriated manually – a rather costly endeavor. This cost is off-set somewhat with collaborative cataloging, a fact on which I rely when indicating that this cost factor is a lesser problem in comparison to the benefit of precision. Inconsistency (both intra-and inter-indexer) will always be a potential when human indexers are involved. Additionally, and commonly, searchers choose terms not included by indexers.
These potential problems have prompted many to attempt to bridge the divide between full-text indexing and manual indexing with the use of computer programs. More specifically, projects are underway which endeavor to link the primary terms gleaned automatically through the aforementioned application of stemming programs, etc., with particular controlled subject vocabularies such as the Library of Congress classification scheme. These ongoing projects are exciting developments in the field, and hold promise for future use, but are not yet viable for widespread use.
Surrogacy is a term that brings instantly to mind the idea of a substitute. It may seem counter-intuitive to render a resource more findable by inserting an artificial layer between the resource and the searcher, but such is the case in the modern indexing world. Full-text indexing is gaining in popularity, but it is my judgment that until automated indexing can solve the various problems of inaccuracy by providing clear, accurate, and specific results, someone must do it themselves. The only practical way for this to happen is through the creation of records containing information about the resource that provides the user with multiple points of access to the identification of the resource. As long as physical collections of resources are the locus of consideration, only some system of surrogacy will allow for a collocated organization of the collection. In other words, surrogacy is the way to go – it removes much of the labor!
1 Evan Esar, 20,000 Quips & Quotes (New York: Barnes & Noble Books, 1995) p. 284
2 Arlene G. Taylor and Daniel N. Joudrey, The Organization of Information, 3rd Edition (Westport, CT: Libraries Unlimited, 2009) p. 473.