Date: Thu, 21 Aug 86 09:00:46 edt From: vtisr1!irlistrq To: fox Subject: IRList Digest V2 #37 (retransmitted after mailer problems) Status: R IRList Digest Wednesday, 13 August 1986 Volume 2 : Issue 37 Today's Topics: Discussion - Work at Bellcore on Collins English Dictionary Abstracts - Appearing in latest issue of ACM SIGIR Forum, Part 1 ---------------------------------------------------------------------- Date: Fri, 8 Aug 86 11:35:23 edt From: amsler@mouton.bellcore.com (Robert Amsler) Subject: Re: IRList Digest V2 #33 on Collins Dictionary Work Collins Dictionary Work: Work at Bellcore is proceeding with an effort to make a comprehensive database format for the CED comparable to that prepared by Jim Peterson for the W7 (Merriam-Webster Seventh Collegiate). The following is an approximation to the format we intend to convert the data into. Headword H1: H H2: Headword H3: Homograph Number H4: Syllabification (as numeric code) H5: Preferred Hyphenation (as numeric code) H6: Headword Part of Speech (n, vb, adj, symbol for) H7: Alternate Part of Speech Alternate form of Headword (Inflectional and Variant forms) A1: A A2: Headword Alternate A3: " " Part of Speech (e.g. n.) A4: " " Inflection Type (e.g. pl.) A5: " "'s Primary Headword A6: " "'s Homograph Number A7: Type of Alternate Form (e.g. USA, for U.S. Spelling) Pronunciation P1: P P2: Pronunciation P3: Type of Pronunciation (e.g. USA, for U.S.; French, etc.) Label L1: L L4: Sense Number of this Label L5: Subsense Number of this label L3: Label L4: Type of Label (Temporal, Usage, Connotative, Subject, National Regional) Definition D1: D D4: Definition Sense D5: " Subsense Letter D6: " " Part (signified by ;'s in definition text) D6: Part of Speech of (Sub)Sense D7: Definition label D8: " Text Cross-Reference X1: X X2: Cross-Referenced Headword X3: "-" " Homograph Number X3: "-" " Sense X4: "-" " Subsense Letter X6: "-" Type (e.g. See, See Also, Also Called, etc.) X7: "-" Definition Text Related Expressions (including Run-In and Run-On Entries) R1: R R2: Type of Related Expression (I = Run-in; O = Run-On) R2: Related Expression R3: " " Part of Speech (e.g. n.) R7: " "'s Primary Headword's SubSense Letter R8: " "'s Primary Headword's SubSense Part Number (;'s) Citation or Example Sentence C1: C C2: Citation's Local Headword C3: Citation's Primary Headword C4: " " "'s Homograph Number C5: " " "'s Part of Speech C6: " " "'s Sense Number C7: " " "'s SubSense Letter C8: " " "'s SubSense Part Number (;'s) Etymology E1: E E2: Primary Headword E3: " " Homograph Number E4: Century E5: Etymology Text Usage Note Comments U1: U U2: Usage Primary Headword U3: " " " Homograph Number U4: Usage Note Text Here is an entry from the B's in this format... [Note: the example was deleted 8/21 since it apparently caused kermit and UUCP to not agree to transfer this file! - Ed] I would very much like to obtain a list of the decoded special symbols in the CED, i.e. those represented by the sequential #800 numbers. These appear to be unique assignments and are nothing but tedium to extract. [Note: We have already extracted the data into a similar form and will be sending that to the Oxford Text Archive soon. Since almost a year of part-time effort, cleaning up data, editing by hand, etc. have been involved, it might be wiser to wait for that. A MS project report by R. Wohlwend documents much of the tape analysis effort. - Ed] ------------------------------ Date: Wed, 23 Jul 1986 13:06 CST From: Vijay V. Raghavan Subject: SIGIR FORUM Abstracts [Part 1 - Ed] [Note: Members of ACM SIGIR should soon receive the spring/summer Forum, and can find these on pages 30-31. The rest will appear in machine readable form also in later issues of IRList. - Ed] ABSTRACTS (Chosen by G. Salton or V. Raghavan from 1984 issues of journals in the retrieval area) 1. APPLICATION OF MODERN TECHNOLOGIES TO INTERLIBRARY RESOURCE- SHARING NETWORKS J. Francis Reintjes Laboratory for information and decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139 Examined in this article is the hypothesis that it is now technologically and economically feasible to move the content of documents electronically among nodes of a library network rather than the documents themselves or photocopies thereof. Comparisons are made on the basis of response-to-request time, quality of reproduced copy and cost factors. The conclusion is reached that electronic interlibrary resource- sharing networks are ideally suited to situations where there are high frequency occurrences of internode requests for information contained in serials, where nodal separation distances do not exceed a few tens of miles and where copy is in six-point type or larger. A three-node network is examined in detail. Specifications for each element of the network are given, with emphasis placed on a highly critical element, the bound-document scanner. The results of an economic study of interlibrary electronic networks are also presented. (JASIS, Vol. 35(1): 45-52; 1984) 2. CO-CITATION ANALYSIS AND THE INVISIBLE COLLEGE Elliot Noma CHI Research/Computer Horizons, Inc., 1050 Kings Highway North, Cherry Hill, NJ 08034 Co-citation analysis is based on the assumption that all citing articles view the scientific literature from a common point-of-view. When a co-citation matrix is analyzed, this assumption affects measures of the dimensionality and clustering of articles. Therefore, before a co-citation matrix is constructed, the citing articles should be limited to those written by individuals in an invisible college. (JASIS, Vol. 35(1): 29-33; 1984.) 3. LESS THAN FULL-TEXT INDEXING USING A NON-BOOLEAN SEARCHING MODEL Donald B. Cleveland School of Library and Information Sciences, North Texas State University, Denton TX 76203 Ana D. Cleveland and Olga B. Wise Texas Woman's University, Denton TX 76204 The relative effectiveness of indexing using full-text or less than full-text was tested using a non-Boolean, chaining type of file structure and searching method. Indexing was done using titles, abstracts, full-text, references, and various combinations of these surrogates and then Goffman's indirect method of information retrieval was used to structure and search the file. The database consisted of 733 documents and 38 queries were searched. The hypothesis of the study was that by using a particular non-Boolean method as a file structuring and searching technique, full-text indexing is not essential to optimum information retrieval effectiveness. The outcome of the study was positive. (JASIS, Vol. 35(1): 19-28; 1984) 4. STATISTICAL RECOGNITION OF CONTENT TERMS IN GENERAL TEXT Martin Dillon School of Library Science, University of North Carolina, Chapel Hill, NC Peggy Federhart Library, IBM Corporation, Charlotte, NC 28257 This article discusses ways to improve the quality of retrieval systems that depend on the use of truncated words of quasi-word stems as an indexing vocabulary. The problems addressed are the generalizability and stability of discriminant function analysis for selecting good topical terms from terms of relatively high frequency in a database drawn from abstracts of Harris Survey press releases. Results confirm that topical terms can be identified by their statistical properties. Consistently high recall of topical terms under a variety of different conditions implies persistent underlying properties strong enough to resist changes in test environment. (JASIS, Vol. 35(1): 3-10; 1984) 5. INFORMATION RETRIEVAL FROM CLASSICAL DATABASES FROM A SIGNAL- DETECTION STANDPOINT - A REVIEW M. H. Heine School of Librarianship & Information Studies New Castle upon Tyne Polytechnic, UK The retrieval of information from classical (object/attribute) databases is discussed in the light of signal-detection theory. The approach is based on the Swetsian schema, although it is expressed in a more general form. (Information Technology, Vol. 3, No. 2. 95-112, April 1984) 6. MAXIMUM ENTROPY AND THE OPTIMAL DESIGN OF AUTOMATED INFORMATION RETRIEVAL SYSTEMS Paul B. Kantor Tantalus Inc. Suite 218, 2140 Lee Road Cleveland, Ohio 44118 The application of the maximum entropy principle is extended to problems of information storage and retrieval. The extension includes continuous or 'fuzzy' relevance valuations, fuzzy descriptors, and prior or feedback constraints. A decomposition property of the entropy function is used to express the total entropy in terms of the entropy of nonoverlapping components. Each component is described by a richness parameter which is determined by a set of coupled constraint equations given in closed form. A method is outlined for solving those equations in real time, and possible grounds for applying the maximum entropy principle are explored. The relation to term weighting, and the possibility of constructing rigorous relations between information and effort, are also discussed. (Information Technology, Vol. 3 No. 2 88-94 April 1984) 7. INFORMATION SCIENCE RESEARCH: THE SEARCH FOR THE NATURE OF INFORMATION Manfred Kochen Schools of Medicine and Business Administration, The University of Michigan, Ann Arbor, MI 48109 High level scientific research in the information sciences is illustrated by a sample of recent discoveries involving the design of information-processing algorithms, bibliometric scaling, and flows of information in biological systems and in countries. It is pointed out that when the concept of information first assumed an independent identity, the only known information processing systems were biological; now, after four decades of vigorous development of electronic information systems, the search for the essential nature of information is focussing again on biological systems and on sociotechnological systems as well. (JASIS, Vol. 35(3): 194-199; 1984) 8. BRIEF HISTORY OF INFORMATION SCIENCE Saul Herner President, Herner and Company, 1700 North Moore Street, Arlington, VA 22209 Information science is the product of convergences of library science, computer and punched card science, R & D documentation, abstracting and indexing, communications science, behavioral science, micro- and macro-publishing, video and optical science, and various other fields and disciplines. The role and contribution of each participating segment is reflected in certain basic and seminal writings, in the work of "major actors" in the field, and in major events or developments. These contributing sources are reviewed, analyzed, and related, as a means of tracing the history of the field, from its pre- and post-World War II beginnings to the early 1980's, to the near-term future. (JASIS, Vol 35(3): 157-163; 1984) 9. A NOTE ON THE USE OF NEAREST NEIGHBORS FOR IMPLEMENTING SINGLE LINKAGE DOCUMENT CLASSIFICATIONS Peter Willett Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom Best match search algorithms provide an efficient means of identifying the sets of nearest neighbors for each of the documents in a collection. These sets contain much of the important similarity data contained in a full interdocument similarity matrix and may be used for the generation of hierarchic document classifications, such as those arising from the use of the single linkage clustering method. Cluster based retrieval experiments based upon such classifications are shown to give results that are comparable in effectiveness with those obtained using the full similarity matrix. (JASIS, Vol. 35(3): 149-152; 1984) ------------------------------ END OF IRList Digest ********************