IRList Digest Friday, 28 November 1986 Volume 2 : Issue 62 Today's Topics: Article - Automatic Indexing of Text For IR: A Conspectus (parts) News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq ---------------------------------------------------------------------- Date: Thu, 13 Nov 86 11:49:21 -0100 From: Wyle Subject: DRAFT of conspectus paper follows ... The conspectus paper is quickly becoming stale news. We just got the proceedings of the Pisa conference and I would like to add them to this paper, but my neural network model is taking up all of my non-teaching time. Anyway, I think chapter 3 is appropriate for IRlist digest, and I would really appreciate critique, comments, criticisms, etc. By the time this paper is distributed, the links between chunet and csnet, and chunet and arpa should make my e-mail address: wyle@ifi.ethz.chunet.csnet or wyle@ifi.ethz.chunet.arpa but the old addresses: ...!decvax!seismo!mcvax!cernvax!ethz!Wyle Wyle%ifi.ethz.cernvax. should still work ok. Without further ado, here is the current DRAFT of my conspectus paper: [Note: I have included part of section 1, and section 4 (bibliography) as well as an aid to the reader. I have left in ^L (formfeeds) for those who print these out and hope that won't hurt others. - Ed] Automatic Indexing of Text For Information Retrieval: A Conspectus M.F. Wyle Institut fuer Informatik Swiss Federal Institute of Technology ETH / SOT 8092 Zuerich, Switzerland 1. Introduction The problems of indexing text have been in human civilization since the earliest collections of written language. It is not possible to read everything we would like, nor to find the specific things we would like to read. In our current information revolution, it is becoming much easier to store and process larger and larger amounts of text. As the quantity of text increases, the quality of its indexing must also increase, in order to discriminate between the concepts contained in the different texts. It is not clear that the quality of current indexing methods is adequate to meet the challenge created by our information revolution. In this paper, we shall try to summarize the current state of indexing methods and research. This first section is a very brief summary of what automatic indexing is, and our perception of current indexing methods. The second section summarizes recent publications, and the final section outlines possible directions of future research. ... [Note: rest of section 1 and all of section 2 is omitted. - Ed] 3. Current Avenues of Research The cost of computing power continues to drop with the advent of new technologies. The deficiencies of current text indexing methods, which currently plague only a few users, will soon become apparent to everyone. The use of write once, read mostly (WORM) media will soon be common, using compact disc technology. These devices are very dense (500-600 Mbytes) and inexpensive. It will therefore be quite simple to have a world telephone book on one 7 cm disk, the collected publications of the ACM on another, and a large encyclopedia on a third. How can we index this information effectively? Our conspectus of the current research does not yield any instant solutions. However, there are some promising results which deserve further study and analysis. 3.1 Thesauri The use of thesauri in text indexing is one important area which has not yet received the attention it deserves. The formal mathematical models proposed by Schaeuble [Schae 86] could be used to construct software which will ensure the logical consistency of a thesaurus. A consistent thesaurus could then be used to index text into consistent, specific concept categories, which may in turn produce great performance improvements. One major drawback in the use of hierarchical dictionaries and thesauri is inflexibility. Natural language is highly dynamic, and the descriptors used to convey concepts change. In addition, completely new concepts and words are introduced into the language at an accelerating rate. The cost of maintaining a large dictionary or thesaurus is therefore high. However, new tools [Dome86] are making the maintenance of such structures much easier. Most existing thesauri are inconsistent and under-used. In some of the large bibliographic retrieval systems, indexing includes the manual assignment of thesaurus descriptors. These systems use thesauri only to enhance recall. However, thesauri could also be used to enhance precision through narrower descriptors. Consistent thesauri could be used to maintain concept consistency in indexing descriptors. The feasibility of using thesauri more effectively in automatic indexing is still an open research question. 3.2 Performance Testing Many indexing systems and techniques were compared using a standard document and query base during experiments in the early 1970's [Salt83]. The AI methods surfacing now have not been evaluated using this standard, and comparison is therefore not possible. Another gaping need for automatic indexing research is therefore the establishment, implementation, and use of a set of standards, to compare the performance of the latest indexing systems to each other and existing ones. Performance experiments are also needed to compare the indexing capabilities of the emerging parallel hardware. Experiments must be performed to benchmark new processors and algorithms in terms of their indexing capabilities, not just their arithmetic processing power. 3.3 Artificial Intelligence applications in text indexing The SMART [Salt83] system attempted unsuccessfully to use syntactic analysis methods to recognize phrases in queries and documents, and to use the phrases as indexing units. Salton concluded that these syntactic methods do not provide improvements over standard retrieval using a thesaurus. However, recent work by Smeaton [Smea86] shows improved retrieval performance by linguistically parsing query and document text as part of the retrieval strategy. The continuously improving parsers developed in natural language processing can be used to better index documents and queries to improve retrieval performance. Although more limited in scope, an expert system and semantic network constructed by Shoval [Shov85] uses term relations and search rules to assist a user find appropriate search terms in a query. Similarly, a knowledge based system approach to document retrieval is presented by Biswas et al [Bisw85]. An important research area in AI appears to be the automation of the construction of semantic networks and knowledge bases. 3.4 Associative Networks An associative network may be used to address the indexing problem in the following way: Construct a large associative network, perhaps using a Boltzmann [Hint84] model. At key nodes, load descriptors and document identifiers from a document base. Then load queries as input to the network, and fix the output at the correct answers to these queries. Allow the network to settle, and examine the resulting connections. These connections will have implemented an algorithm which has perfect performance using the given queries and documents. The associations between nodes may lead to important insights into unconscious or non-obvious connections between descriptors in a document base. The network will also "discover" indirect connections between descriptors which are not apparent but very useful in indexing. The network itself may be used as an indexing system, or as a method of automatically assigning relevance weights to descriptors in order to enhance an existing indexing strategy. Descriptors in the network need not be limited to word stems. They could be syntactic traces, or thesaurus terms, or some combination of descriptor types. If a sufficient set of queries and correct results to a large document collection is made available, an associative network model may give enormous insight into descriptor relations. 4.0 References [Addi83] Addis, T R and L Johnson, "Knowledge for Machines.," in The Fifth Generation Computer Project, State of the Art Report,1983, Pergamon Infotech, Ltd, Maidenhead, Berkshire, England. [Bae84] Baertschi, M, "Term Dependence in information retrieval models," PhD Thesis, ETH, Zrich, 1984. [Bis85] Biswas, G, V Subramanian, and J C Bezdek, "A knowledge based system approach to document retrieval," Second Conference on Artificial Intelligence Applications: The Engineering of Knowledge-Based Systems , IEEE Comput. Soc. Press, Washington, DC, 11-13 Dec. 1985. [Bos85] Bose, P K and M Rajinikanth, "KARMA: knowledge- based assistant to a database system," Second Conference on Artificial Intelligence Applications: The Engineering of Knowledge-Based Systems , IEEE Computer Society Press, Washington, DC, 11-13 Dec. 1985. [Broo85] Brooks, H M, P J Daniels, and N J Belkin, "Problem descriptions and user models: developing an intelligent interface for document retrieval systems," in Advances in Intelligent Retrieval: INFORMATICS 8. Proceedings of an Aslib/British Computer Society Joint Conference, 16-17 April 1985 Oxford, England, p191- 214, Aslib, London. [Brow85] Brownstein, M, "Managing information intelligently [Quantum's Knowledge Management System]," Hardcopy, vol.14, no. 11, pp. 139-141, November1985. [Chig85] Chignell, M H, A Loewenthal, and P A Hancock, "Intelligent interface design," IEEE 1985 Proceedings of the International Conference on Cybernetics and Society , p. 620-3, Tucson, (12-15 November1985). [Chud84] Chudacek, J, "Non-grammatical Language Processing," Preprint Institute TNO for Mathematics, Information Processing, and Statistics, The Hague, Netherlands (1984). [Damo85] D'Amore, R J, Mah, C P, "One Time Complete Indexing of Text: Theory and Practice." Proceedings of the 8th Annual International ACM SIGIR Conference, Montreal, (1985). [DeHe74] De Heer, T, "The Application of the Concept of Homeosemy to Natural Language Information Retrieval." Information Processing and Management vol 18 no 5 (1982). [Dien85] Diener, R A V, "Relational knowledge structures: a structural model of information for research and retrieval.," in Challenges to an Information Society. Proceedings of the 47th ASIS Annual Meeting, Philadelphia, Pennsylvania, October 1985. [Dome86] Domenig, M., Shann, P., "Towards a Dedicated Database Management System for Dictionaries," Proc. 11th International Conference on Computational Linguistics, August 25-29 1986, IKP Universitaet Bonn. [Fren86] Frenkel, K A, "Evaluating Two Massively Parallel Machines," Comm ACM vol 29 no 8, (August 1986). [Giri84] Girill, T R, "Online Access Aids for Documentation: A Bibliographic Outline," ACM SIGUCCS 12th User Services Conference, Reno, Nevada (12 November 1984). [Hint84] Hinton, G E, Sejnowski, T J, and Ackley, D H, "Boltzmann Machines: Constraint Satisfaction Networks that Learn," Technical Report CMU-CS-84-119, Carnegie Mellon University, (May 1984). [Huff52] Huffman, D, "A Method for the Construction of Minimu Redundancy Codes, " Proc. IRE v 40 p 1098 - 1101 (September1952). [Jona84] Jonak, Z, "Automatic Indexing of Full Texts," Information Processing and Management, vol. 20, no. 5-6, pp. 619-627, 1984. [Kuhl83] Kuhlen, R, "Natural language research.," SIGART Newsletter, no. 83, pp. 20-21, January 1983. [Kwok84] Kwock, K L, "A document-document similarity measure based on cited titles and probability theory, and its application to relevance feedback retrieval," in Research and Development in Information Retrieval, The British Computer Society Workshop Series, University Press, Cambridge, (1984). [McCu85] McCune, B P, R M Tong, J S Dean, and D G Shapiro, "RUBRIC: a system for rule-based information retrieval," IEEE Trans. Software Eng. , vol. SE-11, no. 9, pp. 939-945, Mountain View, CA, September 1985. [Medl85] Meder, N, "Artificial intelligence as a tool of classification, or: the network of language games as cognitive paradigm," Int. Classif. (Germany), vol. 12, no. 3, pp. 128-132, 1985. [Mite85] Mitev, N N and S Walker, "Information retrieval aids in an online public access catalogue: automatic intelligent search sequencing," in Advances in Intelligent Retrieval: INFORMATICS 8. Proceedings of an Aslib/British Computer Society Joint Conference, 16-17 April 1985 Oxford, England, p. 215-26, Aslib, London, 1985. [Mris86] Morris, D A, "GEFILE, The Electronic File Cabinet," General Electric Company Silicon Systems Technology Department Press Release, (August 1986). [Morr85] Morrissey, J M, "Interactive Querying Techniques for an Office Filing Facility.," Information Processing and Management, vol. 22, no. 2, pp. 121-34, 1986. [Pate84] Patel-Schneider, P, Brachman, R, Levesque, H, "ARGON: Knowledge Representation Meets Information Retrieval," Fairchild Technical Report No 654, (September 1984). [Salt83] Salton, G and M J McGill, Introduction to modern information retrieval, McGraw Hill International Book Company, Paris,1983. [Salt84] Salton, G, "Extended boolean information retrieval - an outline," in National Online Meeting 1984, ed. T H Hogan, p 339-346, Learned Information, Inc., Medford (1984). [Salt86] Salton, G, "Another Look At Automatic Text Retrieval Systems," Comm ACM vol 29 no 7 (July 1986), p 648 - 656. [Schae86] Schaeuble, P, Frei, H P, "Thesauri in Information Retrieval," unpublished. [Schn83] Schneider, C, "Syntaktische Relationen in der automatischen Indexierung zur Relationierung von Deskriptoren am beispiel juristischer dokumente," PhD Dissertation, Regensburg, 1983. [Shov85] Shoval, P, "Principles, procedures, and rules in an expert system for information retrieval," Information and processing management, vol. 21, no. 6, pp. 475- 487, 1985. [Smea86] Smeaton, A F, "Incorporating Syntactic Information Into a Document Retrieval Stragegy: An Investigation" Prepublication. [Teuf86] Teufel, B, Schmidt, S, "Full Text Retrieval Based on Syntactic Similarities," unpublished [Wong85] Wong, S K M, and Ziarko, W, "On Generalized Vector Space Model In Information Retrieval," Ann Soc Math Series IV, Fundam Inf, v 8, no 2, p 253-267, (1985). [Zamo81] Zamora, E M, Pollack, J J, Zamora, A, "The Use of Trigram Analysis for Spelling Error Detection," Information Processing and Management, vol 17 no 6 (1981). ------------------------------ END OF IRList Digest ********************