IRList Digest Tuesday, 7 May 1988 Volume 4 : Issue 34 Today's Topics: Abstract - Selected abstracts appearing in SIGIR FORUM (part 2 of 2) News addresses are Internet or CSNET: fox@vtopus.cs.vt.edu BITNET: foxea@vtvax3.bitnet ---------------------------------------------------------------------- Date: Tue, 17 May 88 09:10:51 CDT From: "Dr. Raghavan" Subject: Abstracts from SIGIR Forum [Part II of II - Ed.] [Note: this is the final part, continued from previous issue - Ed.] ABSTRACTS [Note: continued - Ed.] ON MODELING OF INFORMATION RETRIEVAL CONCEPTS IN VECTOR SPACES S.K.M. Wong, W. Ziarko, V.V. Raghavan, and P.C.N. Wong, Department of Computer Science, University of Regina, Regina, Canada S4S 0A2 The Vector Space Model (VSM) has been adopted in information retrieval as a means of coping with inexact representation of documents and queries, and the resulting difficulties in determining the relevance of a document relative to a given query. The major problem in employing this approach is that the explicit representation of term vectors is not known a priori. Consequently, earlier researchers made the assumption that the vectors corresponding to terms are pairwise orthogonal. Such an assumption is clearly unrealistic. Although attempts have been made to compensate for this assumption by some separate, corrective steps, such methods are ad hoc and, in most cases, formally inconsistent. In this paper, a generalization of the VSM, called the GVSM, is advanced. The developments provide a solution not only for the computation of a measure of similarity (correlation) between terms, but also for the incorporation of these similarities into the retrieval process. The major strength of the GVSM derives from the fact that it is theoretically sound and elegant. Furthermore, experimental evaluation of the model on several test collections indicates that the performance is better than that of the VSM. Experiments have been performed on some variations of the GVSM, and all these results have also been compared to those of the VSM, based on inverse document frequency weighting. These results and some ideas for the efficient implementation of the GVSM are discussed. (ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 12, No. 2, pp. 299-321, 1987) TERM CO-OCCURRENCE IN CITED/CITING JOURNAL ARTICLES AS A MEASURE OF DOCUMENT SIMILARITY Donna Trivison, 1453 Elbur Avenue, Lakewood, OH 44107, Term co-occurrences were measured in pairs of cited/citing research articles selected over the period of time from 1971 until 1983 from a core literature in the field of information science. A consistent pattern of term similarity was observed in these article pairs. In contrast, document similarity was extremely low in randomly paired articles selected from the same core data base. In 77% of cited/citing articles, there were more co- occurrences of significant terms than there were in 87% of the same articles paired randomly. The study served to quantify terminology-relatedness. A comparison of the similarity of cited/citing literature of various ages resulted in an indication of the amount of new terminology entering the field. And, because a clear delineation was achieved between the similarity of cited/citing articles and the similarity of non-cited/citing articles, the results were extended to define an expected success rate of a matching procedure in one context of information retrieval. (INFORMATION PROCESSING 7 MANAGEMENT, Vol. 23, No. 3, pp. 183-194, 1987) KNOWLEDGE-SPARSE AND KNOWLEDGE-RICH LEARNING IN INFORMATION RETRIEVAL Roy Rada, National Library of Medicine, Bethesda, MD 20894 This paper reviews some aspects of the relationship between the large and growing fields of machine learning (ML) and information retrieval (IR). Learning programs are described along several dimensions. One dimension refers to the degree of dependence of an ML + IR program on users, thesauri, or documents. This paper emphasizes the role of the thesaurus in ML + IR work. ML + IR programs are also classified in a dimension that extends from knowledge-sparse learning at one end to knowledge-rich learning at the other. Knowledge-sparse learning depends largely on user yes-no feedback or on word frequencies across documents to guide adjustments in the IR system. Knowledge-rich learning depends on more complex sources of feedback, such as the structure within a document or thesaurus, to direct changes in the knowledge bases on which an intelligent IR system depends. New advances in computer hardware make the knowledge-sparse learning programs that depend on word occurrences in documents more practical. Advances in artificial intelligence bode well for knowledge-rich learning. (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 3, pp. 195-210, 1987) KNOWLEDGE RESOURCE TOOLS FOR ACCESSING LARGE TEXT FILES Donald E. Walker, Artificial Intelligence and Information Science Research, Bell Communications Research, 435 South Street MRE 2A379, Morristown, NJ 07960 This paper provides an overview of a research program just being defined at Bellcore. The objective is to develop facilities for working with large document collections that provide more refined access to the information contained in these ``source'' materials than is possible through current information retrieval procedures. The tools being used for this purpose are machine-readable dictionaries, encyclopedias, and related ``resources'' that provide geographical, biographical, and other kinds of specialized knowledge. A major feature of the research program is the exploitation of the reciprocal relationships between sources and resources. These interactions between texts and tools are intended to support experts who organize and use information in a workstation environment. Two systems under development will be described to illustrate the approach: one providing capabilities for full-text subject assessment; the other for concept elaboration while reading text. Progress in the research depends critically on developments in artificial intelligence, computational linguistices, and information science to provide a scientific base, and on software engineering, database management, and distributed systems to provide the technology. (PROCEEDINGS OF THE FIRST CONFERENCE OF THE UNIVERSITY OF WATERLOO CENTER FOR THE NEW OXFORD ENGLAND DICTIONARY, Waterloo, Canada, pp. 11-24, November, 1985) PICTURES OF RELEVANCE: A GEOMETRIC ANALYSIS OF SIMILARITY MEASURES William P. Jones, Microelectronics and Computer Technology Corporation, P.O. Box 200195, Austin, Texas 78720 and George W. Furnas, Bell Communications Research, 435 South Street, Morristown, N.J. 07960 We want computer systems that can help us assess the similarity or relevance of existing objects (e.g., documents, functions, commands, etc.) to a statement of our current needs (e.g., the query). Towards this end, a variety of similarity measures have been proposed. However, the relationship between a measure's formula and its performance is not always obvious. A geometric analysis is advanced and its utility demonstrated through its application to six conventional information retrieval similarity measures and a seventh spreading activation measure. All seven similarity measures work with a representational scheme wherein a query and the database objects are represented as vectors of term weights. A geometric analysis characterizes each similarity measure by the nature of its iso-similarity contours in an n- space containing query and object vectors. This analysis reveals important differences among the similarity measures and suggests conditions in which these differences will affect retrieval performance. The cosine coefficient, for example, is shown to be insensitive to between-document differences in the magnitude of term weights while the inner product measure is sometimes overly affected by such differences. The context-sensitive spreading activation measure may overcome both of these limitations and deserves further study. The geometric analysis is intended to complement, and perhaps to guide, the empirical analysis of similarity measures. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 6, pp. 420-442, 1987) 3 I R: A NEW APPROACH TO THE DESIGN OF DOCUMENT RETRIEVAL SYSTEMS W.B. Croft and R.H. Thompson, Department of Computer and Information Science, University of Massachusetts, Amherst, MA 01003 The most effective method of improving the retrieval performance of a document retrieval system is to acquire a detailed specification of the user's information need. The system described in this article, IIIR, provides a number of facilities and search strategies based on this approach. The system uses a novel architecture to allow more than one system facility to be used at a given stage of a search session. Users influence the system actions by stating goals they wish to achieve, by evaluating system output, and by choosing particular facilities directly. The other main features of IIIR are an emphasis on domain knowledge used for refining the model of the information need, and the provision of a browsing mechanism that allows the user to navigate through the knowledge base. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 6, pp. 389-404, 1987) HYPERTEXT: AN INTRODUCTION AND SURVEY Jeff Conklin, Microelectronics and Computer Technology Corp., P.O. Box 200195, Austin, TX 78720 As workstations grow cheaper, more powerful, and more available, new possibilities emerge for extending the traditional notion of ``flat'' text files by allowing more complex organizations of the material. Mechanisms are being devised which allow direct machine-supported references from one textual chunk to another; new interfaces provide the user with the ability to interact directly with these chunks and to establish new relationships between them. These extensions of the traditional text fall under the general category of hypertext (also known as nonlinear text). This article is a survey of existing hypertext systems, their applications, and their design. It is both an introduction to the world of hypertext and, at a deeper cut, a survey of some of the most important design issues that go into fashioning a hypertext environment. (COMPUTER, Vol. 20, No. 9, pp. 17-42, 1987) PARALLEL QUERYING OF LARGE DATABASES: A CASE STUDY Harold S. Stone, IBM T.J. Watson Research Center, Parallelism by itself does not necessarily lead to higher speed. In the case study presented here, the parallel algorithm was far less efficient than a good serial algorithm. The study does, however, reveal how to best use parallelism to best use - run the more efficient serial algorithm in a parallel manner. The case study extends the work of Stanfil and Kahle, who presented an algorithm for high-speed querying of a large database. They demonstrated the use of a parallel program running on a 16,000-processor Connection Machine and obtained estimates for the running time of the algorithm on a 64K- processor system with queries made against a very large database of Reuters news releases. Their results show that the throughput for parallel query analysis is high in an absolute sense. But they did not provide a performance analysis of speedup or other aspects of algorithmic behavior that would reveal what factors of machine and algorithm design contribute most strongly to the performance. This article provides that analysis. (COMPUTER, Vol. 20, No. 10, pp. 11-12, 1987) HISTORICAL NOTE: A PERSONALIZED HISTORY OF OCLC Frederick G. Kilgour, Founder Trusteed, OCLC Online Computer Library Center, Inc., Dublin, Ohio (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5, pp. 381-384, 1987) HISTORICAL NOTE: THE PAST THIRTY YEARS IN INFORMATION RETRIEVAL Gerard Salton, Department of Computer Science, Cornell University, Ithaca, New York 14853 The doucmentation literature of the 1950s is reviewed briefly, and some early text processing endeavors are discussed. Various predictions made in 1960 by Mooers about the creative role of computers in information retrieval are then considered, and an attempt is made to explain why some of the more exciting predictions have not been fulfilled. Conclusions are drawn concerning the limits of computer power in text retrieval applications. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 5, pp. 375-380, 1987) HISTORICAL NOTE: INFORMATION SCIENCE AND TECHNOLOGY: FROM COORDINATE INDEXING TO THE GLOBAL BRAIN Cloyd Dake Gull, 8 Pimlico Court, Silver Spring, MD 20906 (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5, pp. 338-366, 1987) HISTORICAL NOTE: SHINING PALACES, SHIFTING SANDS: NATIONAL INFORMATION SYSTEMS Harold Wooster, Senior Information Scientist (Retired), Lister Hill National Center for Biomedical Communications, National Library of Medicine, Department of Health and Human Services, Bethesda, MD 20894 This article discusses post-Sputnik national information systems under three major headings: Shifting Sands, the false assumptions that the Soviets were first in space because of the superiority of their educational system and their scientific and technical information system, VINITI; The Shining Palaces lists as appendixes 31 reports since 1958 which propose various forms of a national information system, and analyzes 30 National Plans. The author does not presume to favor any of them; in Solid Rock-The Ugly Houses the author lists in an appendix the involvement of the federal government with scientific and technical information since the first patent act of 1709, and discusses what he thinks should be done for the users of a national system, the role of technical documentary reports, project information systems and scientific journals. The Summary and Conclusions starts with three quotations, written 22 years apart, which show that nothing has changed in over two decades. In a Personal Note the author summarizes his forty year career as an information scientist. (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5, pp. 321-335, 1987) ------------------------------ END OF IRList Digest ********************