Date: Sun, 27 Oct 85 19:48 EST To: irdis at vpi Subject: IRList Digest V1 #17 Reply-To: IRList@vpi.csnet IRList Digest Sunday, 27 Oct 1985 Volume 1 : Issue 17 Today's Topics: References - Abstracts of Talks at ACM SIGIR 85 in Montreal ---------------------------------------------------------------------- From: "V.J. Raghavan" Date: Tue, 27 Aug 85 16:02:23 cst Subject: Conf. '85 [This year's Montreal Conference on IR had many interesting papers. A file, in "SMART" form, of the abstracts, was typed in at Univ. of Regina and edited at Virginia Tech. Order info for proceedings is: Proc. of the Eighth Annual Int. ACM SIGIR Conf. on R&D in Inf. Ret., ACM Order No. 606850 -- Ed] .I 1 .T Graphical Information Resources: Maps and Beyond .A MICHAEL LESK .W The rise of computer graphics offers a new challenge for information retrieval: how to search and retrieve information which is partly or wholly graphical. As an example, procedures for handling geographical information, such as street maps and directories, are explained. With this data, it is possible to find routes on maps, retrieve locations and names of people or businesses, and draw maps. But a comparision of these programs with programs for face processing or computer typesetting makes clear how far we are from general purpose routines. Today, successful graphics routines contain a great deal of local domain knowledge. There is no analog of the simple keyword systems that handle textual documents in any subject area. Just as computational linguists have found that subject matter expertise is nessary to do really sophisticated processing of English, it seems also necessary for sophisticated processing of pictures; the difference is that we don't know how to do unsophisticated processing of graphics. .I 2 .T Implications of Boolean Structure for Probabilistic Retrieval .A ABRAHAM BOOLSTEIN .W The purpose of an information retrieval system is to retrieve documents in response to a request. However, there are many strategies as to how this can be accomplished. Most popular are the Boolean systems attached to the commercial on-line database systems. Less conspicuous are a number of approaches being developed in university research laboratories and which are the object of this paper. This paper is divided into two parts: The first wil be a more or less nontechnical overview of research in IR over the last ten to fifteen years. The second part, which is a bit more technical and a lot more speculative, will cover some interesting questions that are now being studied. .I 3 .T Generalized Vector Space Model In Information Retrieval .A S. K. M. WONG WOJCIECH ZIARKO PATRICK C. N. WONG .W In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from the automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems. The preliminary experimental results obtained from the new model are very encouraging. .I 4 .T Handling Multiple Data Bases in Document Retrieval .A IAN A. MACLEOD .W There is no such thing as a standard document. Bibliographic information comes in a wide variety of formats. Existing retrieval systems handle different document styles either by creating an artificial document type or by providing different and independent data bases. Neither approach seems satisfactory. In this paper we describe a data model which we feel is more appropriate for document representation and show it can handle the multiple document type problem quite naturally. .I 5 .T Theoretical Measures in P/Q Document Spaces .A ROBERT R. KORFHAGE .W Early work on the use of profiles to aid in the interpretation of queries in an information retrieval system developed the concepts of retrieval subspaces that are either ellipsoids or Cassini ovals. This report extends the theoretical basis of these models, exploring other models and the effect of different metrics on the conceptual retrieval subspace. .I 6 .T Composite Document Extended Retrieval - An Overview .A EDWARD A. FOX .W 'Composite document' is the name given herein to items which possess some complex internal structure and are made up of a variety of types of data, including a significant amount of textual material. Examples are: elaborate bibliographic records, office correspondence or reports, electronic mail, digests from computer bulletin boards, entries in online name servers for network users, books, and parts of some expert system knowledge bases. Investigation of composite documents is being done in connection with research in passage retrieval, database management, office modeling of documents, natural language analysis, and information retrieval. Preliminary results indicate that extending the vector space model leads to better document representation, and that extending standard Boolean logic query capabilities results in more effective searches. Further study with a variety of specially prepared test collections will aid evaluation of the several models and methods proposed. .I 7 .T Automatic Assignment of Soft Boolean Operators .A GERARD SALTON ELLEN VOORHEES .W The conventional bibliographic retrieval systems are based on Boolean query formulations and inverted file implementations. Such systems provide rapid responses in answer to search queries but they are not easy to use by uninitiated patrons. An extended Boolean retrieval strategy has been devised in which the Boolean operators are treated more or less strictly, depending on the setting of a special parameter, known as the p-value. The extended system is much more forgiving than the conventional system, and provides better retrieval effectiveness. In this study various problems associated with the determination of appropriate p-values are discussed, and suggestions are made for an automatic assignment of p-values. Evaluation output is included to illustrate the operations of the suggested procedures. .I 8 .T Output Ranking Methodology for Document-Clustering-Based Boolean Retrieval Systems .A TADEUSZ RADECKI .W During the last two decades or so several alternative approaches to document clustering have been introduced, each of which constitutes a basis for a number of various clustering methods. Unfortunately, these methods are limited in many ways. In particular, most of the available document clustering procedures have been designed to be applicable in the cases where both documents and user queries are represented by sets of weighted or unweighted index terms. However, in commercial document retrieval systems queries are usually characterized by Boolean expressions of variables corresponding to the index terms of interest to the system's users. Therefore, it would be desirable to extend the conventional approaches to document clustering with the aim of implementing them in operational environments. In developing these extended approaches, one of the most important problems to be solved appears to be related to the system's adaptation to individual users. More specifically, in the course of such an adaptation the system should utilize all the information available about the user's needs, and document and query representations should be modified accordingly, so that the most relevant documents could be retrieved. This paper presents an an account of the author's most recent research along these lines. After providing a background for widely known approaches to document clustering, rigorous reasoning for extending these approaches to a conventional Boolean retrieval framework is given. .I 9 .T The New Oxford English Dictionary and its Potential Users: Some Preliminary Comments .A JOHN STUBBS .W On May 15, 1984, the Oxford University Press formally announced its intention to computerize the Oxford English Dictionary. The OED and its Supplement, which comprise approximately 60 million words, will be entered into a computer; a database design will be created for this integrated version of the Dictionary; and, in addition to its printed form, the Dictionary will be available in electronic form. .I 10 .T Processing Free-Text Input to Obtain a Database of Medical Information .A EMILE C. CHI CAROL FRIEDMAN NAOMI SAGER MARGARET S. LYMAN .W The Linguistic String Project of New York University has developed computer programs that convert the information in free-text documents of a technical specialty into a structured form suitable for mapping into a relational database. The processing is based upon the restrictions on the use of language that are characteristic of the subject matter and the document type. These restrictions are summarized in a "sublanguage grammar" that provides a set of word classes and formulas corresponding to the objects and relations of interest in the domain. The programs are independent of the particular sublanguage grammar employed. The application to narrative patient records will be described and the applicability of the methods to other domains discussed. .I 11 .T An Approach to Multikey Sequencing in an Equiprobable Keyterm Retrieval Situation .A B. G. T. LOWDEN .W We present a simple generalized technique, for sequencing a multi- attribute file, which can be used in a situation where the query pattern is unknown and the term content equiprobable. The method is based on constructing a short spanning path through the records thereby minimizing the sum of the Hamming distances between them. Retrieval performance is simulated over a range of query expressions and results suggest a significant reduction in the number of block accesses using this technique as compared with records randomly distributed over the file space. .I 12 .T Optimization of Inverted Vector Searches .A CHRIS BUCKLEY ALAN F. LEWIT .W A simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. This optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. An extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. Experimental runs are made examining several different term weighting models and showing the optimization possible with each. .I 13 .T P-Trees: Storage Efficient Multiway Trees .A DAVID M. ARNOW AARON M. TENENBAUM CONNIE WU .W A new variation of high order multiway tree structures, the P-tree, is presented. P-trees have average access costs that are significantly better than those of B-trees and are no worse (and often better) in storage utilization. Unlike compact B-trees, they can be maintained dynamically, and unlike dense multiway trees and B-trees, their associated insertion algorithm, which is also presented, is cheap and involves (at most) a very localized rearrangement of keys. .I 14 .T Efficient Variants of Huffman Codes in High Level languages .A Y. CHOUEKA S. T. KLEIN T. PERL .W Although it is well-known that Huffman Codes are optimal for text compression in a character per character encoding scheme, they are seldom used in practical situations since they require a bit-by-bit decoding algorithm, which has to be written in some assembly language and will perform rather slowly. A number of methods and some slight variants are here presented that avoid these difficulties. The decoding algorithms efficiently process the encoded string on a byte-per-byte basis, and can be programmed in any high level language. This is achieved at the cost of reducing the net compression savings by a negligible amount, and storing some tables in the internal memory. Experimental results are presented. .I 15 .T Optimization of a Hierarchical File Organization for Spelling Correction .A TETSURO ITO CLEMENT T. YU .W The application of a hierarchically organized file to spelling check and correction problems seems to be promising, since it can correct more than the common typing mistakes. However, the speed of retrieving the exactly matched records to the input is rather slow, and the time for computing similarity values is not negligible. Here some techniques to overcome these defects are presented along with the approximate estimation of the improved file search time. .I 16 .T Designing an Information Retrieval Interface Based on User Characteristics .A CHRISTINE L. BORGMAN DONALD O. CASE CHARLES T. MEADOW .W With the increasing number of information retrieval systems and databases available and the increasing demand of end users to utilize the system, a need exists for improved training mechanisms. This paper reports on a project to develop an integrated online instruction and assistance system to be used as a "front end" to the U.S. Department of Energy's RECON retrieval system. The conceptual framework for the interface design will be based on individual characteristics of its current and prospective users, predominantly scientists conducting energy research. We will build a prototype based on information gained from interviews with scientists using the system (either directly or through a search intermediary) and interviews with search intermediaries. In addition, we will outline a separate interface tailored to search intermediary needs, based on interviews and on known differences between end users and intermediaries in information seeking behavior and conceptual views of information systems. The paper reports on research in progress through the initial interviews. The remaining parts of the project are outlined. .I 17 .T Different levels of expertise for an expert system in information retrieval .A B. DEFUDE .W The aim of this article is to describe the specifications of an Information Retrieval Expert System (IRES). The resort to expert systems is a recent way which seems to be very convenient to realize performant and convivial systems. In fact the capabilities allowed by classical Information Retrieval Systems (IRS) are not adequate. This inadequacy comes from some important restrictions : - limited indexing vocabulary, very often designed a priori. - the search terms have to belong to this vocabulary and there are just some simple semantical relations (as synonymy for example) which can be used during the search. Consequently, if the terms used in the query are not those used for indexing we do not have a good answer. .I 18 .T One-Time Complete Indexing of Text: Theory and Practice .A RAYMOND J. D'AMORE CLINTON P. MAH .W N-gram indexing provides a significant alternative to the keyword indexing and full text scanning methods. N-grams are sets of overlapping word fragments selected for indexing according to information-theoretic considerations. They provide complete, fixed indexing of all words, numbers, and special terms in text. The behavior of such an n-gram system can be statistically modeled and validated over a wide range of text. The model provides a descriptive and predictive tool for controlling precision and recall during retrievals and scaling estimates of relevance to an adaptive, reference noise distribution for a given collection. Partial inversion of and simple data compression are some of the capabilities which make n-gram systems performance-competitive with other approaches. .I 19 .T Experiments with cited titles for automatic document indexing and similarity measure in a probabilistic context .A K. L. KWOK .W Most people will agree that the ultimate goal of an information retrieval (IR) system would be to incorporate as many artificial intelligence techniques as possible so that the system would be able to reply to queries with specific answers (so called fact retrieval). Current capabilities for large scale general purpose retrieval however is still at the stage of document retrieval level, namely retrieving documents that most probably contain answers to a query. This may not necessarily be a transient stage at all, as has been pointed out that document retrieval capability may serve as a pre-processing stage to the more detailed fact retrieval, even if the problems associated with the latter are assumed solved. .I 20 .T A Learning Algorithm Applied to Document Redescription .A MICHEAL D. GORDON .W Relevance feedback in document retrieval usually involves an inquirer revising a query to obtain better retrieval effectiveness [Rocchio] and [Salton]. The revised query is adjusted to correspond to the descriptions of known documents which the inquirer has found relevant. Several problems exist with such methods, however, and will be described in this paper. .I 21 .T The Cluster Hypothesis Revisited .A ELLEN M. VOORHEES .W A new means of evaluating the cluster hypotheses is introduced and the results of such an evaluation are presented for four collections. The results of retrieval experiments comparing a sequential search, a cluster- based search, and a search of the cluster collection in which individual documents are scored against the query are also presented. These results indicate that while the absolute performance of a search on a particular collection is dependent on the pairwise similarity of the relevant documents, the relative effectiveness of clustered retrieval versus sequential retrieval is independent of this factor. However, retrieval of entire clusters in response to a query usually results in a poorer performance than retrieval of individual documents from cluster. .I 22 .T C. T. YU Y. T. WANG C. H. CHEN .W It is well known [Salt, LaYu] that the retrieval of documents stored in clusters allows queries to be answered efficiently. Furthermore, clustered retrieval may sometimes gain in retrieval effectiveness over relevance term weighting [Crof]. It has also been showm [VanR] that clustering of terms, when used properly (e.g. in tree dependence model [HaVa]) will gain significant improvement over retrieval processes based on the independence of terms [RoSp, YuSa]. Although numerous clustering algorithms [Salt, VanR, Ever] have been proposed and tested, the following problems remain. 1) the clustering algorithms are extremely expensive to use, if the number of objects (documents or terms) is large; 2) when the user enviroment changes, it is necessary to perform reclustering; 3) most clustering algorithms cluster objects based on syntactic relationships between the objects only. Information which is extractable from previous users is ignored. .I 23 .T Concepts of the Cover Coefficient-Based Clustering Methodology .A FAZLI CAN ESEN A. OZKARAHAN .W Document clustering has several unresolved problems. Among them are high time and space complexity, difficulty of determining similarity thresholds, order dependence, nonuniform document distribution in clusters, and arbitrariness in determination of various cluster initiators. To overcome these problems to some degree, the cover coefficient based clustering methodology has been introduced. The concepts used in this methodology have created certain new concepts, relationships, and measures such as the effect of indexing on clustering. These new concepts are discussed. The result of performance experiments that show the effectiveness of the clustering methodology and the matching function are also included. In these experiments, it has been also observed that the majority of the documents obtained in a search are concentrated in a few clusters containing a low percentage of documents of the database. .I 24 .T The LIVE Project - Retrieval Experiments Based on Evaluation Viewpoints .A P. BOLLMANN F. JOCHUM U. REINER V. WEISSMANN H. ZUSE .W The LIVE-project (Leistungsbewertung von Information Retrieval VErfahren) at the Technische Universitaet Berlin, West Germany concerns with the evaluation of information retrieval systems. Two fields are mainly under investigation. One area is about the investigation of methodological foundations of retrieval experiments. Results of the LIVE-project can be seen on three different areas: on the one hand measurement-theoretical criteria for the application of simil- arity and evaluation measures in retrieval experiments have been considered and developed. Further work has been done in the application of statistical principles of experimental design in information retrieval. Especially control structures of factors in retrieval tests have been investigated and some aspects of statistical models for experimentation in information retrieval. The other topic in the LIVE project is the conduction of a retrieval experiment in cooperation with FIZ4 which is an information service center for mathematics physics and energy in Karlsruhe, West Germany. Amongst the many databases which FIZ4 is offering, the LIVE-project used the database about physics in their retrieval experiment. .I 25 .T On Retrieval Tests with an Inhomogeneous Query Collection .A FRIEDBERT JOCHUM .W Retrieval tests are often performed with an inhomogeneous query collection, that means certain properties of queries vary within the collection. The presented paper investigates under which conditions this inhomogeneity may lead to a loss of validity of retrieval tests and should be controlled by an experimental design. In particular the interrelationship between experimental design of a retrieval test, inhomogeneity of the used query collection and measurement theoretical properties of the used evaluation measure will be discussed. .I 26 .T The Impact of VLSI Generalized Associative Memory Devices on the Intelligence and Architecture of Computer Systems .A D. R. MCGREGOR J. R. MALONE M. HENNING .W An intelligent system requires to function at different time-scales: - fast associative recall of previously encountered information - more complete, but slower search for hypotheses, checks for consistency and so on. We must not expect automatic systems to do these in seconds. This paper describes a novel database system which used a new type of associative memory (now being implemented in VLSI) to improve its speed of response. .I 27 .T A Testbed for Information Retrieval Research: The Utah Retrieval System Architecture .A LEE A. HOLLAAR .W The Utah Retrieval System Architecture provides an excellent testbed for the development and testing of new algorithms or techniques for information retrieval. URSA is a message-based structure capable of running on a variety of system configurations, ranging from a single mainframe processor to a system distributed across a number of dissimilar processors. It can readily support a variety of specialized backend processors, such a high- speed printers. The architecture divides the components of a text retrieval system into two classes: servers and clients. A triple of servers (index, search, and document access) for each database provide the capabilities normally associated with a retrieval system. Possible clients for these servers include a window-based user interface, whose query language can be easily modified, a connection to a mainframe host processor, or Al-based query modification programs that wish to use the database. Any module in the system can be replaced by a new module using a different algorithm as long as the new module complies with the message formats for that function. In fact, with some care this module switch can occur while the system is running, without affecting the users. A monitor program collects statistics on all system messages, giving information regarding query complexity, processing time for each module, queuing times, and bandwidths between every module. This paper discusses the background of URSA and its structure, with particular emphasis on the features that make it a good testbed for information retrieval techniques. .I 28 .T An Integrated Hierarchical File Organization For Data Selection and Retrieval .A SAU-LAN LEE .W This paper presents an approach, the IHF approach, of organizing data of a hierarchical structure into one single file, maintaining adequate represen- tation of the relationships among the data. The IHF approach requires each data item be stored in the basic form of name-value pair and consists in assigning each data item with identifier which provides not only the data item identification but also an implicit identification of the hierarchical structural relationships among groups of data items. Data selection and retrieval operations are effectively implemented with the concept of masking of relevant identifier bits, without resorting to any linking devices. Application of the IHF technique on CAFS is examined. .I 29 .T RUBRIC: An Environment for Full Text Information Retrieval .A RICHARD M. TONG VICTOR N. ASKMAN JAMES F. CUNNINGHAM CARL J. TOLLANDER .W This paper describes an ongoing research project that is concerned with developing a computer-based aid for retrieval from unformated full text databases. Unlike other attempts to improve upon Boolean keyword retrieval systems, our research concentrates on providing an easily used rule-based language for expressing retrieval concepts. This language draws upon work both in artificial intelligence and the theories of evidential reasoning to allow the user to construct queries that have better precision and recall than more traditional forms. The paper includes a discussion of our formal model of retrieval, the rule language, and some comparative performance results. .I 30 .T ANNOD - A Navigator of Natural-language Organized (Textual) Data .A ROBERT E. WILLIAMSON .W ANNOD is the name of a system developed at the National Library of Medicine (NLM), which implements a set of linguistic and empirical techniques that permit retrieval of natural language information in response to natural language queries. The system is based on Dr. Gerard Salton's SMART document retrieval system and is presently implemented on a mini-computer as part of an Interactive Text Management System, ITEMS. Actual experience with retrieval of information from NLM's Hepatitis Knowledge Base (HKB), an encyclopedic hierarchical, full-text file, is presented. The techniques used in ANNOD include: automatic stemming of words, common word deletion, thesaurus expansion, a complex empirical matching (ranking) algorithm (similarity measure), and techniques expressly designed to permit rapid response in a mini-computer environment. Preliminary testing demonstrates high efficiency in identifying portions of a text which are relevant to users. .I 31 .T The User's Mental Model of an Information Retrieval System .A CHRISTINE L. BORGMAN .W An empirical study was performed to train naive subjects in the use of a prototype Boolean logic-based information retrieval system on a bibliographic data-base. Subjects were undergraduates with little or no prior computing experience. Subjects trained with a conceptual model of the system performed better than subjects trained with procedural instructions, but only on complex, problem-solving tasks. Performance was equal in simple tasks. Differences in patterns of interaction with the system (based on a stochastic process model) showed parallel results. Most subjects were able to articulate some description of the system operation, but few articulated a model similar to the card catalog analogy provided in training. Eleven of 43 subjects were unable to achieve minimal competency in system use. The failure rate was equal between training conditions and genders: the only differences found between those passing and failing the benchmark test were in academic major and frequency of library use. .I 32 .T A Study of the Relationship Between User Profiles and User Queries .A MICHAEL A. SHEPHERD .W This research investigates the relationship between a user profile and a user query by examining the degree of similarity of a user profile and a query and the degree of similarity between the sets of documents retrieved by the profile and the query, respectively. This relationship will form the basis for the integration of a profile and a query into a single enhanced query to retrieve a set of documents that reflects the similarity between the sets of documents retrieved by the profile and by the query. Such an integration will permit the development of an expert search intermediary in which the user profile acts as a knowledge base in providing a context for the interpretation of the query to the database. .I 33 .T A Conceptual Model and Experiments on How People Classify and Retrieve Documents .A ALBERTO J. CANAS FRANK SAFAYENI DAVE W. CONRATH .W In other words, a better understanding of the way in which people classify and search for information may be instrumental for designing more effective automated systems. Cognitive psychologists have dealt with the problem of categorization for many years (e.g. Rosch, 1978) but not within the context of storage and retrieval. Their work has been based on the categorization of simple stimuli (e.g. geometric shapes) and has not addressed the problem of retrieval after categorization. The purpose of this paper is twofold: 1) To present a conceptual model of the cognitive processes in the storage and retrieval of documents. 2) To describe ongoing experiments and their potential implications in the design of automated systems. ------------------------------ END OF IRList Digest ********************