IRList Digest Sunday, 28 August 1988 Volume 4 : Issue 47 Today's Topics: Discussion - Stemming (recall V4 #42) Announcement - SGML Standard for Machine Readable Dictionaries Workshop Abstracts - Dissertations selected by S. Humphrey [Part 4 of 5] News addresses are Internet: fox@fox.cs.vt.edu or fox%fox.cs.vt.edu@dcssvx.cc.vt.edu BITNET: foxea@vtcc1.bitnet (replaces foxea@vtvax3) ---------------------------------------------------------------------- Date: 08. August 1988, 18:02:02 (CET) From: XID2FUHR@DDATHD21.BITNET (Norbert Fuhr) Subject: ... Comment on Stemming Dear Ed, ... Reply to Donna Harman' comments on stemming in IRLIST #42: I do not agree with Donna Harman's comments on the absence of any quality differences between different stemming algorithms. We have been working with two kinds of stemming in our work and found them very useful: - The first algorithms reduces nouns to their singular form and verbs to their infinite form. We call the result the standard form. - The second algorithm is similar to the one used in the SMART system and reduces all words to their stem, e.g. computation, computing and computer to 'comput'. Unfortunately the algorithms have been published only in German: R. Kuhlen: Experimentelle Morphologie in der Informationswissenschaft. Verlag Dokumentation, Muenchen, 1977. The point is that you have to assign different weights to the terms in the documents according to the stemming algorithm employed: When you have the term 'computers' in your query and you find 'computer' in a document, that is both terms agree in the standard form, then you should assign a higher weight than in the case where you find 'computation' in a document, so that the terms only have equal word stems. Now the only problem is to assign proper weights in the different cases. We have described our approach in the paper presented at the SIGIR88: "The Automatic Indexing System AIR/PHYS - from Research to Application" (Biebricher et al.). A more theoretic description and possible applications are outlined in the forthcoming paper "Models for Retrieval with Probabilistic Indexing" (by N. Fuhr) which will appear in Information Processing and Management. Kind regards, Norbert ------------------------------ Date: Fri, 12 Aug 88 19:10:42 EDT From: Robert A Amsler Subject: Workshop Announcement DICTIONARY ENCODING INITIATIVE A ONE-DAY WORKSHOP ON THE DEVELOPMENT OF AN SGML STANDARD FOR MACHINE-READABLE DICTIONARIES Hosted by Robert A. Amsler and Frank Wm. Tompa Wednesday, October 26, 1988, 10 AM - 5 PM (the day before the 1988 Waterloo Conference: Information in Text) Davis Building, University of Waterloo, Ontario, Canada The development of a text standard for the interchange of machine- readable lexical entries is seen as an essential step toward making such information useful to future generations of computational scientists and scholars. Whereas several ad hoc schemes for encoding dictionary entries exist, and even larger numbers of idiosyncratic typesetting formats exist, there is an increasing number of variants of such formats being propagated through the research community. Without the introduction of some standard formats for the interchange of such information, both the publishing and research communities will suffer. A preliminary draft of such an interchange standard for encoding machine-readable English monolingual dictionary entries has been developed in Standard Generalized Markup Language (SGML). This workshop will present the contents and rationale for this standard and offer attendees the opportunity to join the Dictionary Encoding Initiative to refine and complete the standard. We are both inviting your commentary and soliciting your help in attempting to make the resultant standard serve the needs of all researchers. If you are able to attend the workshop, please reply via email or postal mail to: Robert A. Amsler Dictionary Encoding Initiative Workshop Bellcore, MRE 2D-398 445 South Street P.O. Box 1910 Morristown, NJ 07960-1910, USA email: amsler@flash.bellcore.com uunet.uu.net!bellcore!amsler ------------------------------ Date: Wed, 3 Aug 88 13:36:58 EDT From: "Susanne M. HUMPHREY" Subject: dissertation abstracts [Note: Part 4 of 5 - Ed.] .[ AN University Microfilms Order Number ADG88-04609. AU FAGAN, JOEL L. IN Cornell University Ph.D 1988, 278 pages. TI EXPERIMENTS IN AUTOMATIC PHRASE INDEXING FOR DOCUMENT RETRIEVAL: A COMPARISON OF SYNTACTIC AND NONSYNTACTIC METHODS. DE Information Science. AB In order for an automatic information retrieval system to effectively retrieve documents related to a given subject area, the content of each document in the system's database must be represented accurately. This study examines the hypothesis that better representations of document content can be constructed if the content analysis method takes into consideration the syntactic structure of document and query texts. Two methods of automatically generating phrases for use as content indicators have been implemented and tested experimentally. The non-syntactic (or statistical) method is based on simple text characteristics such as word frequency and the proximity of words in text. The syntactic method uses augmented phrase structure rules (production rules) to selectively extract phrases from parse trees generated by an automatic syntactic analyzer. Experimental results show that the effect of non-syntactic phrase indexing is inconsistent. For the five collections tested, increases in average precision ranged from 22.7% to 2.2% over simple, single term indexing. The syntactic phrase indexing method was tested on two collections. Precision figures averaged over all test queries indicate that non-syntactic phrase indexing performs significantly better than syntactic phrase indexing for one collection, but that the difference is insignificant for the other collection. More detailed analysis of individual queries, however, indicates that the performance of both methods is highly variable, and that there is evidence that syntax-based indexing has certain benefits not available with the non-syntactic approach. Possible improvements of both methods of phrase indexing are considered. It is concluded that the prospects for improving the syntax-based approach to document indexing are better than for the non-syntactic approach. The PLNLP system was used for syntactic analysis of document and query texts, and for implementing the syntax-based phrase construction rules. The SMART information retrieval system was used for retrieval experimentation. This thesis is available as a technical report from the Department of Computer Science, Cornell University. .] .[ AN University Microfilms Order Number ADG88-02784. AU JACOBS, SHEILA MAUREEN. IN Arizona State University Ph.D 1987, 175 pages. TI HYPOTHESIS-CONFIRMING INFORMATION SEARCH STRATEGIES AND COMPUTERIZED INFORMATION RETRIEVAL SYSTEMS. DE Information Science. AB A recent trend in information retrieval systems technology is the development of on-line information retrieval systems. One objective of these systems has been to attempt to enhance decision effectiveness by allowing users to preferentially seek information, thereby facilitating the reduction or elimination of information overload. These systems do not necessarily lead to more effective decision making, however. Recent research in information search strategy suggests that when users are seeking information subsequent to forming initial beliefs, they may preferentially seek information to confirm these beliefs. Therefore, decision making effectiveness may be dependent on the accuracy of the decision maker's initial hypothesis of causality. It seems that effective computer-based decision support requires an information retrieval system capable of: (a) retrieving a subset of all available information, in order to reduce information overload, and (b) supporting an information search strategy that considers all relevant information, rather than merely hypothesis-confirming information. An information retrieval system with an expert component (i.e., a knowledge-based DSS) should be able to provide these capabilities. The basic research question is: Will the use of a KBDSS, designed to search for and present both confirming and disconfirming evidence, result in enhanced decision effectiveness? Enhanced decision effectiveness is defined, in this study, as a significant change to the initial attribution of causality for a described problem. To assess the effect of information retrieval system type on decision effectiveness, a laboratory experiment was conducted. Participants were presented with brief work histories describing a job performance problem and suggesting a cause for the problem. They were required to make an initial attribution of causality for the problem, to query either a conventional on-line information retrieval system or a KBDSS for additional information, and then to make a final attribution of causality. The results of this study are not conclusive; there was neither strong confirmatory evidence nor strong disconfirmatory evidence regarding the effectiveness of the KBDSS. Further research on this type of decision aid is needed before definite recommendations can be made regarding the design of computer-based decision aids that support preferred information search strategies. .] .[ AN University Microfilms Order Number ADG87-27638. AU NARA, HIROSHI. IN University of Kansas Ph.D 1987, 201 pages. TI MODULAR DENOTATIONAL SEMANTICS IN A ROBUST NATURAL LANGUAGE FRONT-END TO A RELATIONAL DATABASE. DE Language, Linguistics. AB This dissertation describes the details of a robust and transportable natural language interface to a relational database. Called the English Database Access and Management System (EDAMS), it differs from many other Natural Language Interfaces (NLIs) in that the parser and the semantic component work in tandem so that, as soon as a denoting expression is parsed, the corresponding semantics is given to it. These two components communicate with each other very closely, until the parse for the entire input string is successfully interpreted. The emphasis of the dissertation is the design and implementation of the semantic component. The semantics of a basic expression is given by first reducing it to a procedure in SQL/DML Emulator, which is executed to compute the referent of the expression. The COMPOSE module assembles the referents of basic expressions and builds the denotation of progressively larger derived expressions, ultimately giving the semantics to the entire input. In the implementation of the semantic component, special attention is paid to the semantic analysis of measure adjectives, noun compounds, and quantifiers. In the analysis of these adjectives, their meanings are procedurally defined, and semantically complex adjectives are decomposed into more elementary attributes found in the database. Noun compounds are given interpretation by way of 'semantic connectedness.' The system works well with a multi-file relational database, responds satisfactorily to syntactically deviant and telegraphic queries for improved robustness, and has the ability to detect denotationally empty expressions early in the parsing process and to use this information to reject unfruitful parses. The dissertation concludes with an evaluation of EDAMS, possible ways to enhance reference and composition algorithms, and possible extensions to the present system. EDAMS offers many amenities: an interactive module to register, view, and manipulate compounds, alternate spellings, synonyms, and abbreviations, facilities for both interactive and batch processing of queries, a spelling checker, an ATN compiler, interactive access to domain dependent information, a system access manager for controlled access to EDAMS, a dictionary access manager, facilities for historical databases, and facilities to permit hierarchical data to reside in the relational database. .] From rootcsh Wed Aug 3 16:56 EDT 1988 Received: by mcs.nlm.nih.gov (5.59/1.14) id AA10005; Wed, 3 Aug 88 15:54:17 EDT Date: Wed, 3 Aug 88 15:54:17 EDT From: humphrey@mcs.nlm.nih.gov (Susanne M. HUMPHREY) Message-Id: <8808031954.AA10005@mcs.nlm.nih.gov> To: fox@fox.cs.vt.edu, humphrey@mcs.nlm.nih.gov Subject: Re: dissertation abstracts Status: RO Ed, I noticed a typo. The line: IN University of California, Los Angeles Ph.Do 1987, 219 pages. should be: IN University of California, Los Angeles Ph.D 1987, 219 pages. --Susanne From rootcsh Fri Aug 5 16:19 EDT 1988 Return-Path: Received: from mcs.nlm.nih.gov by RELAY.CS.NET id aa03581; 5 Aug 88 14:08 EDT Received: by mcs.nlm.nih.gov (5.59/1.14) id AA09050; Fri, 5 Aug 88 14:03:38 EDT Date: Fri, 5 Aug 88 14:03:38 EDT From: "Susanne M. HUMPHREY" Message-Id: <8808051803.AA09050@mcs.nlm.nih.gov> To: fox%vtopus.cs.vt.edu@RELAY.CS.NET, humphrey@MCS.NLM.NIH.GOV Subject: a few more Status: R Ed, another bunch. This will probably be it for a while. --Susanne .[ AN University Microfilms Order Number ADGD--80478. AU YOON, CHOON SUP. IN University of Edinburgh (United Kingdom) Ph.D 1987, 325 pages. TI A HOUSING INFORMATION SERVICE: A SYSTEMATIC APPROACH TOWARDS THE EFFECTIVE USE OF STRUCTURED BUILDING APPRAISALS IN THE DESIGN OF NEW HOUSING. DE Architecture. AB Available from UMI in association with The British Library. Requires signed TDF. This study is concerned with the search for workable improvements in the design of housing schemes by means of feedback obtained through the appraisal and measurement of performance of existing housing schemes. Feedback information is seldom fully utilised by designers. This is due on the one hand, to the scattered and disorganised nature of feedback information sources and on the other, to the general lack of exchange of experience and information between designers. Valuable experience gained from past projects is thereby often wasted, resulting in the tendency to repeat mistakes and to overlook the existence of proven solutions. There is then, a serious need for access to sources of relevant information, enabling us to find simply and precisely what we want without continual reference to colleagues or written sources. This can only be achieved where there is a provision for the structuring of feedback information, ensuring its easy retrieval and in a form that can be readily used. To this end, this thesis proposes a computerised housing information service which will process feedback information derived from the analysis and appraisal of existing housing schemes. Furthermore, this thesis explores whether the establishment of such a housing information service on a national scale would be both a desirable and viable proposition. Discussion of the conceptual and technical specifications for the proposed service is followed by the description of a small pilot demonstration system, developed to appraise potential user acceptance. The results of a series of system demonstrations are analysed. .] .[ AN University Microfilms Order Number ADG88-02472. AU BRICKER, ROBERT JAMES. IN Case Western Reserve University Ph.D 1987, 392 pages. TI AN EMPIRICAL INVESTIGATION OF THE INTELLECTUAL STRUCTURE OF THE ACCOUNTING DISCIPLINE: A CITATIONAL ANALYSIS OF SELECTED SCHOLARLY JOURNALS, 1983-1986. DE Business Administration, Accounting. AB This study empirically investigated the intellectual structure and knowledge accumulation of the scholarly accounting discipline. A model of competition in the research environment, entitled the Research Markets Model, was synthesized from existing literature and used as the basis for the hypothesis formation. It was hypothesized that the accounting discipline could be represented by a model portraying an arrangement of many research areas which recursively nest together to form larger research areas at more general levels of association. This model formed an intellectual structure and consisted of two components--a representational structure which is a syntactic expression of the intellectual structure, and intellectual content which is a semantic expression of the intellectual structure. A representational structure was inferred through the application of cocitation clustering to a sample of published accounting literature. The analysis was based on a data sample consisting of nearly 11,000 citations drawn from the main journal articles of six mainstream scholarly accounting journals between 1983 and early 1986. The resulting structure was validated using Multiple Discriminant Analysis. The intellectual content of this representational structure was established through content analysis and bibliometric methods. The representational structure and intellectual content results supported the intellectual structure hypothesis. The integration of the accounting discipline was tested by examining accounting interdisciplinary citation patterns and the structure of the inferred representational structure. The results showed both a lack of structural integration and a disproportionately large reliance upon interdisciplinary models and theories. This suggests that accounting lacks the level of integration shown by other disciplines. The hypothesis that accounting scholars employ a scientific approach to knowledge accumulation was tested by examining accounting citation age patterns. The results suggested that accounting does not accumulate knowledge as scientifically as other social sciences. A systematic bias precluded a firm conclusion. This research is the first attempt to provide an empirical and replicable approach to determining a structure of the accounting discipline. Extensions and innovations to existing methods of analysis were developed during the course of this research. The results demonstrate the existence of numerous individual research areas and their interrelationships, which may help students and scholars understand the accounting discipline. .] .[ AN University Microfilms Order Number ADG88-03957. AU CHANG, PHILIP YEN-TANG. IN The University of Utah PH.D 1987, 163 pages. TI OPTIMIZATION TECHNIQUES FOR RELATIONAL DATABASE SYSTEMS. DE Computer Science. AB Efficient implementation of relational database systems has been a difficult problem noted by many researchers and system implementers. In a relational database system, the efficiency related factors are deliberately hidden from the user. With complete freedom for specifying queries, the users can easily formulate queries that are extremely expensive if implemented directly. It is therefore necessary for a relational database system to include a "query optimizer" in order to improve the efficiency of query execution. This dissertation uses an "automatic programming" approach to develop a "framework" for relational database query optimization. A set of specific techniques is also developed to illustrate how this framework can be applied to different database environments. Three kinds of optimization form the basis of the framework: query transformation, binding and run-time processing. Query transformation is to transform user queries to equivalent queries that are more efficient to implement. Binding techniques are used to select the best algorithm for each relational operator. Run-time processing techniques include pipelining for parallel execution and information feedback for re-evaluation of earlier implementation decisions. It is shown that by applying these three techniques in different degrees, one can design optimizers to fit different system requirements. It is also shown that the framework is general enough as a basis for the comparison of many optimizers developed by others. .] [Note: continued in next issue - Ed] ------------------------------ END OF IRList Digest ********************