From vtisr1!irlistrq Thu Sep 18 18:31:17 1986 Date: Thu, 18 Sep 86 18:31:10 edt From: vtisr1!irlistrq To: fox Subject: IRlist Digest V2 #44 Status: R IRList Digest Thursday, 18 September 1986 Volume 2 : Issue 44 Today's Topics: Email - Address of surveyer of work on automatic indexing Query - Sound-alike matching? COGSCI - Alvey Speech Input Workstation and Word Processor Abstracts - More from latest issue of ACM SIGIR Forum, Part 2 News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq ---------------------------------------------------------------------- From: Mitchell Wyle Date: Sat, 13 Sep 86 13:10:16 -0200 Subject: paper on research in automatic indexing ... I just joined professor H.P. Frei's IR group. I am preparing a conspectus paper of the fundamentals and current research in automatic indexing in IR. Thanks in advance. -M Wyle ------------------------------ Date: Sat, 13 Sep 86 18:37:08 edt From: sdpage%sevax.prg.oxford.ac.uk%sevax.prg.oxford.ac.uk@CS.UCL.AC.UK Subject: Code to match sound-alike words? I have vague memories of a code which will map two words which sound alike onto each other. The typical application is an airline reservations system, where a telephone caller could be saying "Smith" or "Smyth" -- the database query system will match either. Can anyone give me a reference to a code like this one? - Thanks. Stephen Page Programming Research Group -- Oxford sdpage%prg.oxford.ac.uk@cs.ucl.ac.uk [Note: some early work is described below Davidson, Leon. Retrieval of Misspelled Names in an Airlines Passenger Record System. Commun. ACM, 5(5): 169-71, May 1962. Greenfield, R.H. An Experiment to Measure the Performance of Phonetic Key Compression Retrieval Schemes. Meth. Inform. Med., 16: 230-233, 1977. Joseph, D.M. and Ruth L. Wong. Correction of Misspellings and Typographic Errors in a Free-Text Medical English Information Storage and Retrieval System. Meth. Inform. Med., 18: 228-234 (sic?), 1979. perhaps others will comment on more recent articles - Ed] ------------------------------ Date: Tue, 9 Sep 86 18:40:45 edt From: DEJONG%OZ.AI.MIT.EDU@AI.AI.MIT.EDU Subject: Cognitive Science Calendar Date: Tuesday, 9 September 1986 11:09-EDT From: AHAAS at G.BBN.COM Thursday, 11 September 10:00am Room: BBN 2nd Floor Large Conference Room, 10 Moulton St. BBN ARTIFICIAL INTELLIGENCE SEMINAR Interactive Incremental Speech Input: Interim Report on a Linguistics/AI Approach to Speech Recognition Henry Thompson University of Edinburgh The Alvey Large Scale Demonstrator Project entitled 'Speech Input Workstation and Word Processor' is a British effort involving the Plessey company and three universities, including Edinburgh, in the construction of a demonstration prototype of a commercially viable speech input system. Edinburgh is responsible for the speech processing aspects of the project. In this talk I will try to cover three things: 1) An overview of the systems architecture and methodology of our work. We are committed to using explicit knowledge bases at as many levels of the processing as possible, to employing parsing (active chart based) in using those knowledge bases, and to supporting only selective, as opposed to instructional, interaction between levels. 2) A brief report of the performance of our first milestone sys- tem, which came up in June of this year about 18 months in to our five year effort. 3) A more detailed exposition of how we are employing parsing at the segmentation and labelling level. ------------------------------ Date: Wed, 23 Jul 1986 13:06 CST From: Vijay V. Raghavan Subject: More SIGIR FORUM Abstracts [Part 2 - Ed] [Note: Members of ACM SIGIR should have received the spring/summer Forum, and can find these on pages 24-27. The remaining part will appear in machine readable form in the next issue of IRList. - Ed] ABSTRACTS (Selected from recent issues of journals) 6. STATISTICS IN INFORMATION RETRIEVAL EXPERIMENTS V. E. Weissmann Institut fur Angewandte Informatik, Technische Universitat Berlin Projekt LIVE Nearly all people use statistics, but very often in the wrong way. To give some clues for the proper use of statistics, a framework will be developed in this paper to help one understand the methodology of applying statistics in IR experiments. The central idea of this framework is that one should i) distinguish between two kinds of models: an expert model and a mathematical-statistical model, and ii) recognize that these two models are highly interdependent. The argument for the need for these two models (and the distinction between them) will follow a meta-scientific approach of J. D. Sneed[1]. To make the numerous relationships in this framework more comprehensible a graphical method called Isac is used. (INFORMATION PROCESSING AND MANAGEMENT, Vol. 22, No. 1, pp. 29-37, 1986). 7. INFORMATION RETRIEVAL IN AN OFFICE FILING FACILITY AND FUTURE WORK IN PROJECT MINSTREL A. F. Smeaton and C. J. Van Rijsbergen University College Dublin, Department of Computer Science, Belfield, Dublin 4, Ireland In this paper we review filing and retrieval mechanisms for unstructured and mixed media information in an office filing facility. In particular, we concentrate on methods of filing and retrieval using the content of the unstructured or free text parts of office objects, but the state of the art in the handling of voice and image data is also discussed. Two of the ways of implementing content retrieval of free text are to search the text itself or to search some text surrogate. Two of the problems associated with the latter method, choice of an internal representation form and analysis of text into this form, are detailed in the paper. Finally, an outline is given of work to be done as part of Project Minstrel. (INFORMATION PROCESSING AND MANAGEMENT, Vol. 22, No. 2, pp. 135-149, 1986). 8. AN INDUCTIVE SEARCH SYSTEM: THEORY, DESIGN, AND IMPLEMENTATION M. E. Maron and Paul Thompson School of Library and Information Studies University of California, Berkeley, CA 94720 Sean Curry University of California San Francisco, CA 94143 An automated information system that can accept requests for information and, in response, selects and ranks by probability of satisfaction the names of those people who can answer the input queries is described. This information system (called Helpnet) is based on new probabilistic design principles, which were previously proposed (but never implemented) for the document retrieval problem. Helpnet has now been implemented on an IBM Personal Computer. The theoretical design principles used for Helpnet and the computer programs used by this implementation of Helpnet are discussed. Also, a preliminary sensitivity analysis is presented, which looks at the question of how imput errors influence the rankings at the output. The probabilistic design principles used in Helpnet can be applied to a much larger class of similar situations, which we call "inductive search" situations. (IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, Vol. SMC- 16, No. 1, pp. 21-28, January/February 1986) 9. MULTIPLE GENERATION TEXT FILES USING OVERLAPPING TREE STRUCTURES F. Waren Burton Department of Electrical Engineering and Computer Science, University of Colorado at Denver, Denver, Colorado 80202, USA Matthew M. Huntbach Cognitive Studies, University of Sussex, Brighton, U.K. J. (Yiannis) G. Kollias Department of Computer Science, National Technical University of Athens, 9 Heroon Polytechniou Avenue, Zografou Athens (624), Greece When repeatedly editing a text file, one is often faced with a choice of keeping previous generation for backup or deleting previous generations to reduce storage requirements. Since on generation of a text file is often very similar to the previous generation, the above conflict can often be resolved by sharing much of the common information. We propose using a tree structure to represent a text file. Common subtrees can be shared. Results of an experiment with one file are reported. (THE COMPUTER JOURNAL, Vol. 28, No. 4, pp. 414-416, 1985) 10. STRUCTURAL PROPERTIES OF THE STRING STATISTICS PROBLEM A. Apostolico Department of Computer Science, Purdue University, West Lafayette, Indiana 47907 F. P. Preparata Coordinated Sciences Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 A suitably weighted Index Tree such as a B-tree or a Suffix Tree can be easily adapted to store, for a given string x and for all substrings w of x, the number of distinct instances of w along x. The storage needed is seen to be linear in the length of x: moreover, the whole statistics can itself be derived in linear time, off-line of a RAM. If the substring w has nontrivial periods, however, the number of distinct instances might differ from that of distinct nonoverlapping occurrences along x. It is shown here that O(n log n) storage units - n standing for the length of x - are sufficient to organize this second kind of statistics, in such a way that the maximum number of nonoverlapping instances for arbitrary w along x can be retrieved in a number of character comparisons not exceeding the length of w. (JOURNAL OF COMPUTER AND SYSTEM SCIENCES 31, 394-411, 1985) 11. A COMPARISON OF A NETWORK STRUCTURE AND A DATABASE SYSTEM USED FOR DOCUMENT RETRIEVAL W. Bruce Croft Thomas J. Parenty Computer and Information Science Department, University of Massachusetts, Amherst, MA 01003 Database systems have many advantages for implementing document retrieval systems. One of the main advantages would be the integration of data and text handling in a single information system. However, it has not been clear how much a database implementation would cost in terms of efficiency. In this paper, we compare a database implementation and a stand-alone implementation of a flexible representation of the content of documents and the associated search strategies. The representation used is a network of document and index term nodes. The comparison shows that certain features of a database system can have a significant effect on the efficiency of the implementation. Despite this, it appears that a database implementation of a sophisticated document retrieval system can be competitive with a stand- alone implemention. (INFORM. SYSTEMS Vol. 10, No. 4, pp. 377-390, 1985) 12. A NOTE ON NATURAL SELECTION Wlodzimierz Dobosiewicz Department of Computing Information Science, University of Guelph, Guelph, Ontario N1G 2W1, Canada Replacement selection is the most popular algorithm used in the creation of initial runs for a sort/merge external sort. In 1972, Frazer and Wong suggested a variation, called natural selection, which uses an auxiliary memory reservoir to increase the performance of replacement selection. Natural selection produces longer runs than replacement selection if the auxiliary memory reservoir is sufficiently large, but it behaves very strangely when the size of the auxiliary memory is small: while using more memory resources than replacement selection, it creates shorter runs, thus being less efficient. As it turns out, this deficiency can be avoided at low cost. This note presents a variation of natural selection that is efficient when the auxiliary memory is small. (INFORMATION PROCESSING LETTERS 21 (1985) 239-243) ------------------------------ END OF IRList Digest ********************