IRList Digest           Friday, 28 November 1986      Volume 2 : Issue 62

Today's Topics:
   Article - Automatic Indexing of Text For IR:  A Conspectus (parts)

News addresses are ARPANET: fox%vt@csnet-relay.arpa  BITNET: foxea@vtvax3.bitnet
   CSNET: fox@vt   UUCPNET: seismo!vtisr1!irlistrq

----------------------------------------------------------------------
     
Date: Thu, 13 Nov 86 11:49:21 -0100
From: Wyle <seismo!mcvax!ifi.ethz.chunet!wyle>
Subject: DRAFT of conspectus paper follows

...
The conspectus paper is quickly becoming stale news.  We just got the
proceedings of the Pisa conference and I would like to add them to this
paper, but my neural network model is taking up all of my non-teaching
time.

Anyway, I think chapter 3 is appropriate for IRlist digest, and I would
really appreciate critique, comments, criticisms, etc.

By the time this paper is distributed, the links between chunet and
csnet, and chunet and arpa should make my e-mail address:

wyle@ifi.ethz.chunet.csnet    or
wyle@ifi.ethz.chunet.arpa

but the old addresses:

...!decvax!seismo!mcvax!cernvax!ethz!Wyle
Wyle%ifi.ethz.cernvax.<many domains>

should still work ok.

Without further ado, here is the current DRAFT of my conspectus paper:

[Note: I have included part of section 1, and section 4 (bibliography)
as well as an aid to the reader.  I have left in ^L (formfeeds) for
those who print these out and hope that won't hurt others. - Ed]




Automatic Indexing of Text For Information Retrieval:  A Conspectus


M.F. Wyle
Institut fuer Informatik
Swiss Federal Institute of Technology
ETH / SOT
8092 Zuerich, Switzerland


1.  Introduction

The problems of indexing text have been in human civilization since the
earliest collections of written language.  It is not possible to
read everything we would like, nor to find the specific things we would
like to read.  In our current information revolution,  it is becoming
much easier to store and process larger and larger amounts of text.  As
the quantity of text increases, the quality of its indexing must also
increase, in order to discriminate between the concepts contained in the
different texts.  It is not clear that the quality of current indexing
methods is adequate to meet the challenge created by our information
revolution.

In this paper, we shall try to summarize the current state of indexing
methods and research.  This first section is a very brief summary of
what automatic indexing is, and our perception of current indexing
methods.  The second section summarizes  recent publications, and the
final section outlines possible directions of future research.

... [Note: rest of section 1 and all of section 2 is omitted. - Ed]




3.  Current Avenues of Research

The cost of computing power continues to drop with the advent of new
technologies. The deficiencies of current text indexing methods, which
currently plague only a few users, will soon become apparent to
everyone. The use of write once, read mostly (WORM) media will soon be
common, using compact disc technology.  These devices are very dense
(500-600 Mbytes) and inexpensive.   It will therefore be quite simple to
have a world telephone book on one 7 cm disk, the collected publications
of the ACM on another, and a large encyclopedia on a third.  How can we
index this information effectively?  Our conspectus of the current
research does not yield any instant solutions.  However, there are some
promising results which deserve further study and analysis.

3.1	Thesauri

The use of thesauri in text indexing is one important area which has not
yet received the attention it deserves.  The formal mathematical models
proposed by Schaeuble [Schae 86] could be used to construct software 
which will ensure the logical consistency of a thesaurus.  A consistent 
thesaurus could then be used to index text into consistent, specific 
concept categories, which may in turn produce great performance improvements.

One major drawback in the use of hierarchical dictionaries and thesauri
is inflexibility.  Natural language is highly dynamic, and the
descriptors used to convey concepts change.  In addition, completely new
concepts and words are introduced into the language at an accelerating
rate.  The cost of maintaining a large dictionary or thesaurus is
therefore high.   However, new tools [Dome86] are making the maintenance
of such structures much easier.

Most existing thesauri are inconsistent and under-used.  In some of the
large bibliographic retrieval systems, indexing includes the manual
assignment of  thesaurus descriptors.  These systems use thesauri only
to enhance recall.  However, thesauri could also be used to enhance
precision through narrower descriptors.  Consistent thesauri could be
used to maintain concept consistency in indexing descriptors.  The
feasibility of using thesauri more effectively in automatic indexing is
still an open research question.



3.2 	Performance Testing

Many indexing systems and techniques were compared using a standard
document  and query base during experiments in the early 1970's
[Salt83].  The AI methods surfacing now have not been evaluated using
this standard, and comparison is therefore not possible.  Another gaping
need for automatic indexing research is therefore the establishment,
implementation, and use of a set of standards, to compare the
performance of the latest indexing systems to each other and existing
ones.

Performance experiments are also needed to compare the indexing
capabilities of the emerging parallel hardware.  Experiments must be
performed to benchmark new processors and algorithms in terms of their
indexing capabilities, not just their arithmetic processing power.

3.3  Artificial Intelligence applications in text indexing

The SMART [Salt83] system attempted unsuccessfully to use syntactic
analysis methods to recognize phrases in queries and documents, and to
use the phrases as indexing units.  Salton concluded that these
syntactic methods do not provide improvements over standard retrieval
using a thesaurus.  However, recent work by Smeaton [Smea86] shows
improved retrieval performance by linguistically parsing query and
document text as part of the retrieval strategy.  The continuously
improving parsers developed in natural language processing can be used
to better index documents and queries to improve retrieval performance.

Although more limited in scope, an expert system and semantic network
constructed by Shoval [Shov85] uses term relations and search rules to
assist a user find appropriate search terms in a query.   Similarly,  a
knowledge based system approach to document retrieval is presented by
Biswas et al [Bisw85].  An important research area in AI appears to be
the automation of the construction of semantic networks and knowledge
bases.




3.4	Associative Networks

An associative network may be used to address the indexing problem in
the following way:

Construct a large associative network, perhaps using a Boltzmann
[Hint84] model.  At key nodes, load descriptors and document identifiers
from a document base.  Then load queries as input to the network, and
fix the output  at the correct answers to these queries.  Allow the
network to settle, and examine the resulting connections.  These
connections will have implemented an algorithm which has perfect
performance using the given queries and documents.  The associations
between nodes may lead to important insights into unconscious or
non-obvious connections between descriptors in a document base.  The
network will also "discover" indirect connections between descriptors
which are not apparent but very useful in indexing.

The network itself may be used as an indexing system, or as a method of
automatically assigning relevance weights to descriptors in order to
enhance an existing indexing strategy.  Descriptors in the network need
not be limited to word stems.  They could be syntactic traces, or
thesaurus terms, or some combination of descriptor types.  If a
sufficient set of queries and correct results to a large document
collection is made available, an associative network model may give
enormous insight into descriptor relations.  



4.0  References

[Addi83]	Addis, T R and L Johnson, "Knowledge for Machines.," in The 
		Fifth Generation Computer Project, State of the Art 
		Report,1983, Pergamon Infotech, Ltd, Maidenhead, Berkshire,
		England.

[Bae84]		Baertschi, M, "Term Dependence in information retrieval
		models,"  PhD Thesis, ETH, Zrich, 1984.

[Bis85]		 Biswas, G, V Subramanian, and J C Bezdek, "A knowledge
		based system approach to document retrieval," Second  
		Conference on   Artificial  Intelligence Applications:
		The Engineering of Knowledge-Based Systems , IEEE Comput.
		Soc. Press, Washington, DC, 11-13 Dec. 1985.

[Bos85] 	Bose, P K and M Rajinikanth, "KARMA: knowledge-
		based assistant to a database system," Second 
		Conference   on   Artificial Intelligence Applications:
		The Engineering of Knowledge-Based Systems , IEEE
		Computer Society Press, Washington, DC, 11-13 Dec.
		1985.




[Broo85]	Brooks, H M, P J Daniels, and N J Belkin, "Problem
		descriptions and user models: developing an intelligent
		interface for document retrieval systems," in Advances 
		in  Intelligent  Retrieval:  INFORMATICS  8.
		Proceedings of an Aslib/British Computer Society Joint
		Conference, 16-17 April 1985 Oxford, England, p191-
		214, Aslib, London.

[Brow85]	Brownstein, M, "Managing   information   intelligently
		[Quantum's  Knowledge Management System],"
		Hardcopy, vol.14, no. 11, pp. 139-141, November1985.

[Chig85]	Chignell, M H, A Loewenthal, and P A Hancock,
		"Intelligent interface design," IEEE  1985 Proceedings
		of the International Conference on Cybernetics and
		Society , p. 620-3, Tucson, (12-15 November1985).

[Chud84]	Chudacek, J, "Non-grammatical Language Processing,"
		Preprint Institute TNO for Mathematics, Information
		Processing, and Statistics, The Hague, Netherlands (1984).

[Damo85]	D'Amore, R J, Mah, C P, "One Time Complete Indexing
		of Text:  Theory and Practice." Proceedings of the 8th
		Annual International ACM SIGIR Conference,
		Montreal,  (1985).

[DeHe74]	De Heer, T, "The Application of the Concept of
		Homeosemy to Natural Language Information
		Retrieval." Information Processing and Management
		vol 18 no 5 (1982).

[Dien85]	Diener, R A V, "Relational  knowledge  structures: a
		structural model of information for research and
		retrieval.," in Challenges to an Information Society.
		Proceedings of the 47th ASIS Annual Meeting,
		Philadelphia, Pennsylvania, October 1985.

[Dome86]	Domenig, M., Shann, P., "Towards a Dedicated Database
		Management System for Dictionaries," Proc. 11th International
		Conference on Computational Linguistics, August 25-29 1986,
		IKP Universitaet Bonn.

[Fren86]	Frenkel, K A, "Evaluating Two Massively Parallel
		Machines," Comm ACM vol 29 no 8, (August 1986).

[Giri84]	Girill, T R, "Online Access Aids for Documentation:
		A Bibliographic Outline,"  ACM SIGUCCS 12th User
		Services Conference, Reno, Nevada (12 November 1984).




[Hint84]	Hinton, G E, Sejnowski, T J, and Ackley, D H, "Boltzmann
		Machines:  Constraint Satisfaction Networks that Learn,"
		Technical Report CMU-CS-84-119, Carnegie Mellon
		University,  (May 1984).

[Huff52]	Huffman, D,  "A Method for the Construction of
		Minimu Redundancy Codes, " Proc. IRE  v 40 p 1098 -
		1101 (September1952).

[Jona84]	Jonak, Z, "Automatic Indexing of Full Texts,"
		Information Processing and Management, vol. 20, no.
		5-6, pp. 619-627, 1984.

[Kuhl83]	Kuhlen, R, "Natural language research.," SIGART
		Newsletter, no. 83, pp. 20-21, January 1983.

[Kwok84]	Kwock, K L, "A document-document similarity measure based
		on cited titles and probability theory, and its application to
		relevance feedback retrieval," in Research and Development
		in Information Retrieval, The British Computer Society
		Workshop Series, University Press, Cambridge, (1984).

[McCu85]	McCune, B P, R M Tong, J S Dean, and D G Shapiro,
		"RUBRIC: a system for rule-based information
		retrieval," IEEE Trans. Software Eng. , vol. SE-11, no. 9,
		pp. 939-945, Mountain View, CA, September 1985.

[Medl85]	Meder, N, "Artificial  intelligence  as a tool of
		classification, or: the network of language games as
		cognitive paradigm," Int. Classif. (Germany), vol. 12,
		no. 3, pp. 128-132, 1985.

[Mite85]	Mitev, N N and S Walker, "Information   retrieval  aids
		in  an online  public  access catalogue: automatic
		intelligent search sequencing," in Advances  in
		Intelligent  Retrieval: INFORMATICS  8.  Proceedings of
		an Aslib/British Computer Society Joint Conference,
		16-17 April 1985 Oxford, England,  p. 215-26, Aslib,
		London, 1985.

[Mris86]	Morris, D A, "GEFILE, The Electronic File Cabinet,"
		General Electric Company Silicon Systems Technology
		Department Press Release, (August 1986).

[Morr85]	Morrissey, J M, "Interactive Querying Techniques for
		an Office Filing Facility.," Information Processing and
		Management,  vol. 22, no. 2, pp. 121-34, 1986.




[Pate84]	Patel-Schneider, P, Brachman, R, Levesque, H, "ARGON:
		Knowledge Representation Meets Information
		Retrieval," Fairchild Technical Report No 654,
		(September 1984).

[Salt83]	Salton, G and M J McGill, Introduction to modern
		information retrieval, McGraw Hill International Book
		Company, Paris,1983.

[Salt84]	Salton, G, "Extended boolean information retrieval - an
		outline," in National Online Meeting 1984, ed. T H Hogan,
		p 339-346, Learned Information, Inc., Medford (1984).

[Salt86]	Salton, G, "Another Look At Automatic Text Retrieval
		Systems," Comm ACM vol 29 no 7 (July 1986), p 648
		- 656.

[Schae86]	Schaeuble, P, Frei, H P, "Thesauri in Information
		Retrieval,"  unpublished.

[Schn83]	Schneider, C, "Syntaktische Relationen in der
		automatischen Indexierung zur Relationierung von
		Deskriptoren am beispiel juristischer dokumente," PhD
		Dissertation, Regensburg, 1983.

[Shov85]	Shoval, P, "Principles, procedures, and rules in an 
		expert system for information retrieval," Information
		and processing management, vol. 21, no. 6, pp. 475-
		487, 1985.

[Smea86]	Smeaton, A F, "Incorporating Syntactic Information
		Into a Document Retrieval Stragegy:  An Investigation"
		Prepublication.

[Teuf86]	Teufel, B, Schmidt, S, "Full Text Retrieval Based on
		Syntactic Similarities," unpublished

[Wong85]	Wong, S K M, and Ziarko, W, "On Generalized Vector Space
		Model In Information Retrieval," Ann Soc Math Series IV,
		Fundam Inf, v 8, no 2, p 253-267, (1985).

[Zamo81]	Zamora, E M, Pollack, J J, Zamora, A, "The Use of
		Trigram Analysis for Spelling Error Detection," 
		Information Processing and Management, vol 17 no 6
		(1981).

------------------------------
     
END OF IRList Digest
********************