IRList Digest Tuesday, 15 November 1988 Volume 4 : Issue 55 Today's Topics: Abstracts - Recent SIGIR Forum News addresses are Internet: fox@vtopus.cs.vt.edu BITNET: foxea@vtcc1.bitnet (replaces foxea@vtvax3) ---------------------------------------------------------------------- Date: Fri, 23 Sep 88 09:24:31 CDT From: "Dr. Raghavan" Subject: Abstracts in most recent ACM SIGIR Forum ... [Note: I have attempted to strip out all format codes except ones bracketed by "$" for equations. - Ed.] ABSTRACTS (Chosen by G. Salton from recent issues of journals in the retrieval area.) ONLINE TEXT RETRIEVAL VIA BROWSING J. F. Cove and B. C. Walsh, Department of Computer Science, University of Liverpool, Liverpool L69 3BX, England. Browsing refers to information retrieval where the initial search criteria are generally quite vague. The fundamentals of browsing are explored as a basis for the creation of an intelligent computer system to assist with the retrieval of online information. Browsing actions via a computer terminal are examined, together with new methods of accessing text and satisfying user queries. Initial tests with a prototype system illustrated the use of different retrieval strategies when accessing online information of varying structure. The results suggest the construction of a more intelligent processing component to provide expanded capabilities for content extraction and navigation within text documents. (INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 31-37, 1988) AN APPROACH TO THE EVALUATION OF CATALOG SELECTION SYSTEMS Caroline M. Eastman, Department of Computer Science, University of South Carolina, Columbia, SC 29208. The similarities between classification systems for catalog selection and information retrieval systems indicate that similar evaluation methodologies might well be appropriate. The characteristics of classification systems and of information retrieval systems are summarized, and two catalog selection systems (GRANT and Grundy) are presented as examples. The contributions of this article are a discussion of the system characteristics that allow the use of measures such as recall and precision in evaluation and a brief overview of related research within the field of information retrieval. (INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 23-30, 1988) AN IMPROVED ALGORITHM FOR THE CALCULATION OF EXACT TERM DISCRIMINATION VALUES Abdelmoula El-Hamdouchi and Peter Willett, Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, UK. The term discrimination model provides a means of evaluating indexing terms in automatic document retrieval systems. This article describes an efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme. The algorithm has an expected running time proportional to $Nn sup 2$ for a collection of $N$ documents, each of which has been assigned an average of $n$ terms. (INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 17-22, 1988) PREDICTING DOCUMENT RETRIEVAL SYSTEM PERFORMANCE: AN EXPECTED PRECISION MEASURE Robert M. Losee, Jr., School of Information and Library Studies, University of North Carolina, Chapel Hill, NC 27514, USA. Document retrieval systems based on probabilistic or fuzzy logic considerations may order documents for retrieval. Users then examine the ordered documents until deciding to stop, based on the estimate that the highest ranked unretrieved document will be most economically not retrieved. We propose an expected precision measure useful in estimating the performance expected if yet unretrieved documents were to be retrieved, providing information that may result in more economical stopping decisions. An expected precision graph, comparing expected precision versus document rank, may graphically display the relative expected precision of retrieved and unretrieved documents and may be used as a stopping aid for online searching of text data bases. The effectiveness of relevance feedback may be examined as a search progresses. Expected precision values may also be used as a cutoff for systems consistent with probabilistic models operating in batch modes. Techniques are given for computing the best expected precision obtainable and the expected precision of subject neutral documents. (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 529-537, 1987) AN ANALYSIS OF APPROXIMATE VERSUS EXACT DISCRIMINATION VALUES Carolyn J. Crouch, Computer Science Department, Tulane University, New Orleans, LA 70118, USA. Term discrimination values have been used to characterize and select potential index terms for use during automatic indexing. Two basic approaches to the calculation of discrimination values have been suggested. These approaches differ in their calculation of space density; one method uses the average document-pair similarity for the collection and the other constructs an artificial, ``average'' document, the centroid, and computes the sum of the similarities of each document with the centroid. The former method has been said to produce ``exact'' discrimination values and the latter ``approximate'' values. This article investigates the differences between the algorithms associated with these two approaches (as well as several modified versions of the algorithms) in terms of their impact on the discrimination value model by determining the differences that exist between the rankings of the exact and the approximate discrimination values. The experimental results show that the rankings produced by the exact approach and by a centroid-based algorithm suggested by the author are highly compatible. These results indicate that a previously suggested method involving the calculation of exact discrimination values cannot be recommended in view of the excessive cost associated with such an approach: the approximate (i.e., ``exact centroid'') approach discussed in this article yields a comparable result at a cost that makes its use feasible for any of the experimental document collections currently in use. (INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 5-16, 1988) AN EXPERT SYSTEM FOR MACHINE-AIDED INDEXING Clara Martinez, John Lucey, and Elliott Linder, American Petroleum Institute, 156 William St., New York, New York 10038. The Central Abstracting & Indexing Service of the American Petroleum Institute (API-CAIS) has successfully applied expert system techniques to the job of selecting index terms from abstracts of articles appearing in the technical literature. Using the API Thesaurus as a base, a rule-based system has been created that has been in productive use since February 1985. The index terms selected by computer are reviewed by a human index editor, as are the terms selected by CAIS's human indexers. After editing, the terms are used for printed indexes and for online computer searching. (JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, Vol. 27, pp. 158-, 1987) HISTORICAL NOTE: INFORMATION RETRIEVAL AND THE FUTURE OF AN ILLUSION Don R. Swanson, Graduate Library School, University of Chicago, 1100 East 57th Street, Chicago, IL 60637. More than thirty years ago there was good evidence to suggest that information retrieval involved conceptual problems of greater subtlety than is generally recognized. The dramatic development and growth of online services since then seems not to have been accompanied by much interest in these conceptual problems, the limits they appear to impose, or the potential for transcending such limits through more creative use of the new services. In this article, I offer a personal perspective on automatic indexing and information retrieval, focusing not necessarily on the mainstream of research but on those events and ideas over a 34-year period that have led to the view stated above, and that have influenced my perception of important directions for future research. Some experimental tests of information systems have yielded good retrieval results and some very poor results. I shall explain why I think that occurred, why I believe that the poor results merit special attention, and why we should reconsider a suggestion that Robert Fairthorne put forward in 1963 to develop postulates of impotence - statements of what cannot be done. By understanding such limits we are led to new goals, metaphors, problems, and perspectives. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 2, pp. 92-98, 1988) SEMIAUTOMATIC DETERMINATION OF CITATION RELEVANCY: A PRELIMINARY REPORT G. David Huffman, College of Science and Technology, University of Southern Mississippi, Hattiesburg, MS 39406. Technology transfer, research and development and engineering projects frequently require in-depth literature reviews. These reviews are carried out using computerized, bibliographic data bases. The review and/or searching process involves keywords selected from data base thesauri. The search strategy is formulated to provide both breadth and depth of coverage and yields both relevant and nonrelevant citations. Experience indicated that about 10-20% of the citations are relevant. As a consequence, significant amounts of time are required to eliminate the nonrelevant citations. This paper describes statistically based, lexical association methods which can be employed to determine citation relevance. In particular, the searcher selects relevant terms from citation-derived indexes and this information along with lexical statistics is used to determine citation relevance. Preliminary results are encouraging with the techniques providing an effective concentration of relevant citations. (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 573-582, 1987) COMPARING RETRIEVAL PERFORMANCE IN ONLINE DATA BASES Katherine W. McCain, Howard D. White, and Belver C. Griffith, College of Information Studies, Drexel University, Philadelphia, PA 19104. This study systematically compares retrievals on 11 topics across five well-known data bases, with MEDLINE's subject indexing as a focus. Each topic was posed by a researcher in the medical behavioral sciences. Each was searched in MEDLINE, EXCERPTA MEDICA, and PSYCINFO, which permit descriptor searches, and in SCISEARCH and SOCIAL SCISEARCH, which express topics through cited references. Searches on each topic were made with (1) descriptors, (2) cited references, and (3) natural language ( a capability common to all five data bases). The researchers who posed the topics judged the results. In every case, the set of records judged relevant was used to calculate recall, precision, and novelty ratios. Overall, MEDLINE had the highest recall percentage (37%), followed by SSCI (31%). All searches resulted in high precision ratios; novelty ratios of data bases and searches varied widely. Differences in record format among data bases affected the success of the natural language retrievals. Some 445 documents judged relevant were not retrieved from MEDLINE using its descriptors; they were found in MEDLINE through natural language or in an alternative data base. An analysis was performed to examine possible faults in MEDLINE subject indexing as the reason for their nonretrieval. However, no patterns of indexing failure could be seen in those documents subsequently found in MEDLINE through known-item searches. Documents not found in MEDLINE primarily represent failures of coverage - articles were from nonindexed or selectively indexed journals. Recommendations to MEDLINE managers include expansion of record format and modification of journal and article selection policies. (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 539-553, 1987) STRATEGIES FOR BUILDING DISTRIBUTED INFORMATION RETRIEVAL SYSTEMS Ian A. Macleod, T. Patrick Martin, Brent Nordin, and John R. Phillips, Department of Computing and Information Science, Queen's University, Kingston, Ontario, Canada K7L 3N6 In this article we discuss the need for distributed information retrieval systems. A number of possible configurations are presented. A general approach to the design of such systems is discussed. A prototye implementation is described together with the experiences gained from this implementation. (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 511-528, 1987) OPTIMAL BUCKET SIZE FOR MULTIATTRIBUTE RETRIEVAL IN PARTITIONED FILES Caroline M. Eastman, Department of Computer Science, University of South Carolina, Columbia, SC 29298, USA. The problem of optimal bucket size for multiattribute retrieval in partitioned files is considered. The query types considered include exact match queries, range queries, partial match queries, and best match (including nearest neighbor) queries. The similarities among formulas which have been derived in several different contexts are examined. (INFORMATION SYSTEMS, Vol. 12, No. 4, pp. 375-383, 1987) FORWARD MULTIDIMENSIONAL SEARCH WITH APPLICATION TO INFORMATION RETRIEVAL SYSTEMS Charles X. Durand, Computer and Information Sciences, State University of New York, College at Potsdam, Potsdam, NY 13676, USA. A new architecture for information retrieval systems is presented. If it was implemented, this architecture would allow the system to process retrieval statements that are equivalent to fuzzily defined queries. The philosophy on which the centerpiece of this system is based - the document search module - is fully explained in this paper. The emphasis is placed on the quick elimination of irrelevant references. A new technique, that takes into account the user's knowledge to discriminate between documents before they are actually retrieved from the data base, was developed. The search technique uses simple computations to select or eliminate potential candidates for retrieval. This technique does not have, qualitatively, the shortcomings of, not only conventional retrieval techniques, but also retrieval systems that accept relevance feedback from the user, in order to refine the search process. No implementation details have been included in this article and system performance figures are not discussed. (INFORMATION SYSTEMS, Vol. 12, No. 4, pp. 363-370, 1987) THE CD-ROM MEDIUM David H. Davies, Project Manager, Optical Recording Project, 3M Company, 420 North Bernardo Avenue, Mountain View, CA 94043. This article details the critical elements that make up the CD-ROM optical disc medium. This includes the basic laser and drive operational mechanics, the nature of the actual disc itself, the data organization at the channel code level and at the logical file level, and aspects of error correction and detection methods used. A brief synopsis of disc fabrication is presented. The article concludes with descriptions of advances in the technology currently on the horizon. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 1, pp. 34-42, 1988) DESIGN CONSIDERATIONS FOR CD-ROM RETRIEVAL SOFTWARE Edward M. Cichocki and Susan M. Ziemer, I.S. Grupe, Inc., 948 Springer Drive, Lombard, IL 60148. The CD-ROM requires a different kind of retrieval system design from systems on magnetic media because the disc's physical characteristics and drive differ from those of magnetic media. Retrieval system designers must be concerned with ways to minimize seeks (access time), transfer large amounts of data following each seek, store data proximally, and maximize CD-ROM performance. Three methods to maximize that performance are described: single key mode, multiple key mode, and inverted file mode. Well-conceived design and well-executed retrieval systems for CD-ROM databases can result in performance that equals the state-of-the-art online systems. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 1, pp. 43-46, 1988) CD-ROM: POTENTIAL MARKETS FOR INFORMATION Julie B. Schwerin, Info Tech, P.O. Box 633, Pittsfield, VT 05762. With the availability of CD-ROM, users and producers of information products are confronted with a new information delivery medium having different characteristics from anything else that exists today. As this new medium is being introduced in various markets, we are discovering the difference between CD-ROM as ``a new way to look at how we produce and consume information products,'' and ``another variation on a familiar theme.'' Except for a few limitations, the opportunity for CD-ROM in information markets since the beginning has been characterized as broad and rich, virtually unlimited in applications. When approached this way, CD-ROM challenges current practices of publishing and integrating information in a fundamental way. As the medium is introduced in markets today, in its very early stages, it is very limited in its application as compared with current products and represents more of a variation than a revolution in information consumption behavior. Yet as users and producers alike experiment and gain confidence in using CD-ROM, its full potential will be realized for both users and producers. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 1, pp. 54-57, 1988) HYPERMEDIA: FINALLY HERE Tekla S. Perry, Field Editor Reading a book, listening to music, watching a movie: all these traditional means of obtaining information are linear. Every reader, listener, or viewer starts at the beginning and proceeds down the same path to the predetermined ending. The thought process, however, is not linear. The mind jumps from present to past to future, thoughts linked by associations that bring up images, words, scents, and those haunting melodies that linger in your head for days. In 1965 Ted Nelson, a writer and philosopher, coined the word hypertext, which he defined simply as nonlinear reading and writing. He saw computer networks then being developed as the mechanism for hypertext information storage and access, and soon expanded his vision to embrace hypermedia - ways of conveying information that besides text would also incorporate sounds and images. Given the computer technology of the 1960s and 1970s, hypertext was not a workable concept, however. But several recent technological advances have sparked a new wave of interest in hypertext and hypermedia, beyond the theoretical. Today's technology can at last create a practical hypermedia system. (IEEE SPECTRUM, pp. 38-45, 1987) PARALLEL TEXT SEARCH METHODS Gerard Salton and Chris Buckley A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency. COMMUNICATIONS OF THE ACM, Vol. 31, No. 2, pp. 202-215, 1988) FOLIOPUB: A PUBLICATION MANAGEMENT SYSTEM Johann H. Schlichter and Leslie Jill Miller, Xerox Corporation. In contrast to desktop publishing systems, production publishing systems, such as $Xyvision sup 4$ are used for large documents generated by many people. Possible application areas include in-plant publishing of technical manuals and complex reports with many types and sources of content. The tasks of writing, editing, illustrating, and page layout are usually performed by different people. Thus, production publishing requires a sophisticated publication management system that coordinates tasks, manages the data produced by the people performing these tasks, and supports processing operations. The prototype system described in this paper captures and tracks input from 15 authors and graphics specialists and enforces a uniform style on the 200-page quarterly report produced. (IEEE COMPUTER, Vol. 21, No. 1, pp. 61-69, 1988) OPTICAL DISKS BECOME ERASABLE Robert P. Freese, Alphatronix Inc. Since the computer was invented, storage and retrieval of digital information has been a major challenge. Engineers have continually sought to develop more convenient storage methods to hold more data and make the data easier to access. Today's technologies include paper, microfilm, magnetic tape, floppy disks, CD-ROM, and write-once read-many (WORM) optical disks. With roughly the same capacity of 600 Mbytes, WORM disks are the closest thing users so far have to erasable optical storage. But information recorded on a WORM disk can neither be erased nor rerecorded. Although erasable optical recording has been under discussion for several years, through magneto-optic technology this capability is about to be introduced in complete, commercial data-storage systems, including some for desktop computers and workstations. Data storage may never be the same. (IEEE SPECTRUM, pp. 41-45, 1988) INTERMEDIA: THE CONCEPT AND THE CONSTRUCTION OF A SEAMLESS INFORMATION ENVIRONMENT Nicole yankelovich, Bernard J. Haan, Norman K. Meyrowitz, and Steven M. Drucker, Brown University. Hypermedia is simply an extension of hypertext that incorporates other media in addition to text. With a hypermedia system, authors can create a linked body of material that includes text, static graphics, animated graphics, video, and sound. A hypermedia system expressly developed for use in a university setting, Intermedia, provides a framework for object-oriented, direct manipulation editors and applications. With it, instructors can construct exploratory environments for their students as well as use applications for day-to-day work, research, and writing. Intermedia is also an environment in which programmers can develop consistent applications, using object-oriented programming techniques and reusable building blocks. (IEE COMPUTER, Vol. 21, No. 1, pp. 81-96, 1988) FINDING FACTS VS. BROWSING KNOWLEDGE IN HYPERTEXT SYSTEMS Gary Marchionini and Ben Shneriderman, University of Maryland. For hypertext and electronic information systems to be effective, designers must understand how users find specific facts, locate fragments of text that satisfy information queries, or just browse. Users' in information retrieval depends on the cognitive representation (mental model) of a system's features, which is largely determined by the conceptual model designers provide through the human-computer interface. Other determinants of successful retrieval include the users' knowledge of the task domain, information-seeking experience, and physical setting. In this article we present a user-centered framework for information-seeking that has been used in evaluating two hypertext systems. We then apply the framework to key design issues related to information retrieval in hypertext systems. (IEEE COMPUTER, Vol. 21, No. 1, pp. 70-80, 1988) CREATION AND DISTRIBUTION OF CD-ROM DATABASES FOR THE LIBRARY REFERENCE DESK Ron J. Rietdyk, Vice President, SilverPlatter Information Services Inc., 37 Walnut Street, Wellesley Hills, MA 02181. SilverPlatter has been delivering CD-ROM products to the library reference market since August 1986. Before that, the product was tested for about three months at a limited number of libraries. This article summarizes our experiences and gives some first observations on the use of this exciting new technology in libraries. Three important groups are discussed: Information Providers Librarians End-Users in the library All three groups have different interests and concerns. A list of the most significant advantages and objections within each group is given. The article offers ideas about how to overcome the often very real objections of the different players in this marketplace. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 1, pp. 58-62, 1988) TOOLS AND METHODS FOR COMPUTATIONAL LEXICOLOGY Roy J. Byrd, Nicoletta Calzolari, Martin S. Chodorow, Judith L. Klavans, Mary S. Neff, and Omneya A. Rizk, IBM T.J. Watson Research Center, Yorktown Heights, New York 10598. This paper presents a set of tools and methods for acquiring, manipulating, and analyzing machine-readable dictionaries. We give several detailed examples of the use of these tools and methods for particular analyses. A novel aspect of our work is that it allows the combined processing of multiple machine-readable dictionaries. Our examples describe analyses of data from Webster's Seventh Collegiate Dictionary, the Longman Dictionary of Contemporary English, the Collins bilingual dictionaries, the Collins Thesaurus, and the Zingarelli Italian dictionary. We describe existing facilities and results they have produced as well as planned enhancements to those facilities, particularly in the area of managing associations involving the senses of polysemous words. We show how these enhancements expand the ways in which we can exploit machine-readable dictionaries in the construction of large lexicons for natural language processing systems. (COMPUTATIONAL LINGUISTICS, Vol. 13, No. 3-4, pp. 219-240, 1987) LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILIZING THE GRAMMAR CODING SYSTEM OF LDOCE Bran Bouguraev, University of Cambridge Computer Laboratory, Corn Exchange Street, Cambridge, CB2 3QG, England. Ted Briscoe, Department of Linguistics, University of Lancaster, Bailrigg, Lancaster LA1 4YT, England. This article focuses on the derivation of large lexicons for natural language processing. We describe the development of a dictionary support environment linking a restructured version of the Longman Dictionary of Contemporary English to natural language processing systems. The process of restructuring the information in the machine readable version of the dictionary is discussed. The Longman grammar code system is used to construct `theory neutral' lexical entries. We demonstrate how such lexical entries can be put to practical use by linking up the system described here with the experimental PATR-II grammar development environment. Finally, we offer an evaluation of the utility of the grammar coding system for use by automatic natural language parsing systems. (COMPUTATIONAL LINGUISTICS, Vol. 13, No. 3-4, pp. 203-218, 1987) ONE-PASS TEXT COMPRESSION WITH A SUBWORD DICTIONARY Matti Jakobsson, University of Vaasa, Raastuvankatu 31, SF-65100, Vaasa, Finland. A new one-phase technique for compression text files is presented as a modification of the Ziv and Lempel compression scheme. The method replaces parts of words in a text by references to a fixed-size dictionary which contains the subwords of the text already compressed. An essential part of the technique is the concept of reorganization. Its purpose is to drop from the dictionary the parts which are never used. The reorganization principle is based on observations of information theory and structural linguistics. By the reorganization concept the method can adapt to any text file with no a priori knowledge of the nature of the text. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39, No. 4, pp. 262-269, 1988) ------------------------------ END OF IRList Digest ********************