IRList Digest Thursday, 7 April 1988 Volume 4 : Issue 17 Today's Topics: Query - Search engine usable for literary-critical and linguistics study - Graphical display of online thesauri - System design for vertical format recording - Info. on recent events relating to IR - IR and AI - work on parsing? - NPL collection, commercial retrieval techniques - CED Prolog fact base, CD-ROM standards News addresses are Internet or CSNET: fox@vtopus.cs.vt.edu BITNET: foxea@vtvax3.bitnet ---------------------------------------------------------------------- Date: Sun, 6 Mar 1988 10:50 CST From: Robin C. Cover Subject: ACM SIGIR FORUM (posting for textual-studies search engine) Dear Professor Fox, Appended below in a revised/corrected version of a posting I sent last night for IRList. I later uncovered one of the general information sheets for IRList and discovered that there's something called SIGIR FORUM, which is connected to the ACM. I believe that the audience I wish to reach with this posting is the group of computational linguists who do textual studies in a sort of "humanities" arena. In any case, if you could post the message on the SIGIR FORUM, I would be most grateful. . . . [Note: I trust that Bill Frakes and Vijay Raghavan will pick this news item up for the next Forum - they are the current co-editors and will sometime give their addresses and a plea for materials in IRList, I am sure. - Ed.] Many thanks if you can help. Professor Robin C. Cover =========================================================== I'm looking for a search engine which probably does not exist, but I would like advice from those more knowledgable about text retrieval systems. It is a text retrieval system optimized for literary-critical and linguistic study. The major requirements for the search engine are as follows: (1) Literary texts should be "understood" by the system in terms of the individual document structure, as indicated by markup elements. The user should be able to specify within a search argument that proximity values, positional operators, comparative operators and logical operators govern the search argument and the textual units to-be-searched IN ACCORDANCE WITH THE HIERARCHICAL STRUCTURE OF THE DOCUMENT. That is, if a document is composed of books, chapters, pericopes, verses and words, then expressions within the search argument must be able to refer to these particular textual units. If another document (or the *same* document, viewed under a different hierarchical structure) contains chapters, paragraphs, sub-paragraphs, (strophes), sentences and words, then expressions in the search argument should be framed in terms of these textual units. To borrow a definition of "text" from the Brown-Brandeis-Harvard CHUG group: the text retrieval system must be capable of viewing each of its documents or texts as an "ordered hierarchy of content objects (OHCO)." (2) The database structure must be capable of supporting annotations (or assigned attributes) at the word level, and ideally, at any higher textual level appropriate to the given document. Most record-based retrieval systems cannot accommodate the word-level annotations that textual scholars or linguists would like to assign to "words." More commonly, if such databases can be modified to accommodate annotations at the word level, the record-field structure is thereby contorted in ways that introduce new constraints on searching (the inability to span record boundaries, for example). Preferably, even the definition of "word" ought not to be hard-coded into the system. Hebrew, for instance, contains "words" (graphic units bounded by spaces) which may embody three or four distinct lemmas. Minimally, the database must support annotations at the word level (e.g., to account for the assignment of lemma, gloss, morphological parse, syntactic function, etc) and these annotations must be accessible to the search engine/argument. Though not absolutely required, it is desirable that attributes could be assigned to textual units above "word," and such attributes should be open to specification in the search argument. Linguists studying discourse, for example, might wish to assign attributes/annotations at the sentence or paragraph level. (3) The search engine should support the full range of logical operators (Boolean AND, OR, NOT, XOR), user-definable proximity values (within the SAME, or within "n" textual units of various levels), user-definable positional operators (precedence relations governing expressions or terms within the search argument) and comparative operators (for numerical values). The search argument should permit nesting of expressions by parentheses within the larger Boolean search argument. Full regular-expression pattern matching (grep) should be supported, as well as macro (library/thesaurus) facilities for designating textual corpora to be searched, discontinuous ranges or text-spans within documents, synonym groups, etc. Other standard features of powerful text retrieval systems are assumed (set operations on indices; session histories; statistical packages; etc). Most commercial search engines I have evaluated support a subset of the features in (3), but do very poorly in support of (1) and (2). The text retrieval systems which claim to be "full text" systems actually have fairly crude definitions of "text," and attempt to press textual data into rigid record-field formats that do not recognize hierarchical document structures, or are not sufficiently flexible to account for a wide range of document types. Three commercial products which attempt to support (1) are WORDCRUNCHER, Fulcrum Technology's FUL-TEXT and BRS-SEARCH. I know of no systems which intrinsically support requirement (2), though LBASE perhaps deserves a closer look, and a few other OEM products promise this kind of flexibility. It may be possible to press FUL-TEXT or BRS-SEARCH into service since both have some facility for language definition. Another promising product is the PAT program being developed by the University of Totonto in connection with the NOED (New Oxford English Dictionary). But I may have overlooked other commercial or academic products which are better suited for textual study, or which could be enhanced/modified in some fashion other than a bubble-gum hack. It is not necessary that a candidate possess all of the above features, but that the basic design be compatible with extending the system to support these functional specs, and that the developers be open to program enhancements. Ideally, such a system would work with CD-ROM, though this is not an absolute requirement. I would like good leads of any kind, but particularly products that could be leased/licensed under an OEM agreement...for microcomputers, I should add. Thanks in advance to anyone who can suggest names of commercial packages or academic software under development which meet the major requirements outlined above, or which could be *gently* bent to do so. I will be happy to report a summary of responses to this posting if there is general interest. Professor Robin C. Cover ZRCC1001@SMUVM1.bitnet 3909 Swiss Avenue Dallas, TX 75204 (214) 296-1783, 824-3094 [Note: There are many people interested in this type of system. Two groups working on related efforts are the folks at UNC Chapel Hill (including Steve Weiss and John Smith) and the group at U. of Chicago (including Scott Deerwester in Graduate Library School). There are also efforts at Bellcore and OCLC. I hope some of the people involved will reply to you and that you can summarize your observations. - Ed.] ------------------------------ Date: Mon, 7 Mar 88 08:45:14 CST From: Richard Pollard Subject: IRList Digest In "Connecting and Evaluating Thesauri: Issues and Cases," International Classification 14(2) 1987, Roy Rada writes: "MeSH has been placed on several computers at NLM for the exploration of graphic interfaces to MeSH.... the user can enter a term and be graphically shown the hierarchy of terms around it or can choose to traverse the thesaurus from top to bottom by mouse-activating terms on the screen." Two questions: 1) Are there any reports available from NLM about this work? 2) Does anyone know of any other work going on in the area of graphical display of online thesauri? Richard Pollard School of Library and Information Science University of Wisconsin--Milwaukee P.O. Box 413 Milwaukee, WI 53201 [Note: Roy Rada does not have access to his mail at this time - hope another at NLM will reply on his behalf. I think Don Crouch may be doing some related work. Any one else recall relevant projects? - Ed.] ------------------------------ Date: 16 Mar 88 01:20:56 GMT Subject: Submission for comp-theory-info-retrieval From: randolph@mfbbs.UUCP (Verle Randolph) Newsgroups: comp.theory.info-retrieval Subject: Info wanted: vertical format recording Sender: marc@mfbbs.UUCP Reply-To: randolph@mfbbs.UUCP (Verle Randolph) Does anyone have any information on total system design for vertical format recording? PLEASE reply via mail. --- Verle Randolph UUCP: ...!rutgers!pbox!romed!mfbbs!randolph ------------------------------ Date: Fri, 18 Mar 88 17:27 EDT From: LEWIS@UMass Subject: more info, anyone? Dear IRLIST: There were several things in issue 4.9 I'd be interested in seeing follow-ups posted on if anyone has the time. To wit: --If someone attended the Microsoft CDROM conference I'd be interested in hearing about that. In particular, what's up with ACM and database products, and what is the "new Full Text SIG"? [Note: I attended as did some 2000 others. I really like the talk by Henry Kucera of Brown on "The Texture of Text". We did a demo of Virginia Disc 1 which should be out within the next month. There was a SIGIR session which included presentations by several from ACM headquarters about ACM and ACM activities in CDROM and electronic publishing. A very nice technical discussion of CDROM performance modeling work at U. of Waterloo was given then too by D. Ford. There are lots of CD-ROM products now, though few really innovative retrieval systems. CD-I and DVI are still progressing. There is a big push now for multimedia programming, and tools for more conventional CDROM authoring are becoming available. - Ed.] --Anyone read the Goldmann book "Online Research and Retrieval with Microcomputers" and want to post a review? --Anyone attend the Salton & McGill discussion on why IR research doesn't get applied, and want to post a summary? Someday I'll attend something and summarize--I promise! David D. Lewis CSNET: lewis@cs.umass.edu COINS Dept. BITNET: lewis@umass University of Massachusetts, Amherst Amherst, MA 01003 ------------------------------ Date: 21 Mar 88 15:16:00 EST From: James (J.G.) Borynec Subject: Information Retrieval and AI. Sir, I am currently investigating the linkage between Information retrieval and Natural Language understanding. On the face of it, it doesn't seem that hard to parse technical documentation (at least to the Noun phrase level) then create some logical representation for that phrase, and use it for both indexing and retrieval. This parsing wouldn't have to be that sophisticated for it to be much better than the current boolean keyword techniques (I hope!). Have you ever heard of any activities related to this idea. Do you know any good references? I am trying to gather enough information to convince my boss (Dan Zlatin) that this is a worthwhile activity for me to pursue. Thanks in advance ... James Borynec (BORYNEC@BNR.BITNET) [Note: There has been ongoing interest in this type of approach. One presentation at RIAO about work at MIT was provocative but I did not see quantitative measures of the results. Sparck Jones & Tait, Smeaton, Fagan, Croft (and several students) and others have all done work in recent years - please forgive me for not mentioning others and send in some replies about your activities! - Ed.] ------------------------------ Date: Wed, 23 Mar 88 20:12 EDT From: LEWIS@UMass To: foxea@vtvax3.bitnet Subject: queries on NPL collection, commercial retrieval techniques Dear IRLIST readers: Query 1: Does anyone out there have access to the original text of the queries and documents in the NPL test collection? Our only copy has had punctuation, numerals, and abbreviations removed--fine for keyword-based retrieval, but not desirable for parsing! Also (and this is probably wishful thinking), has anyone ever written up a layman's description of what the standard queries mean? (What the heck are "multiple digit techniques in font decimal address"?!) In a number of cases it's hard for me even to be sure where attachment in a syntactic parse should be. (If someone's got parse trees for the queries, that would be useful, too.) Query 2: Does anyone know of a recent work summarizing the retrieval techniques that are actually in use in operational retrieval systems, especially by on-line database companies. The most recent thing I've seen is soem stuff in Salton & McGill's 1983 book. Many thanks, David D. Lewis CSNET: lewis@cs.umass.edu COINS Dept. BITNET: lewis@umass University of Massachusetts, Amherst Amherst, MA 01003 ------------------------------ Date: Tue, 5 Apr 88 10:50 N From: Subject: CED Prolog Fact Base Dear Ed, as you may remember I have a copy of the fact base tape you sent to Oxford. It was the first tape you sent them (in case you sent more since) and I was probably the first to order it as my order was in Oxford before the tape. Up until last week we only used the material for incidental checking as the info we wanted (syntactic properties) was not in a usable form (only part of speech explicitly present). Last week, however, having nothing better to do the last few days before Easter, I tried to transform the data to something more accessible. I aborted the attempt after a while but will try again. Below is a description of where I went wrong and some inconsistencies I found. You probably have found them yourself by now, but then you might not. . . . Those were the consistent glitches I found. I would be interested in newer or reformatted editions of the fact base. Please keep me informed. One more question about the fact base: do you actually have a Prolog system which can contain all this stuff? [Note: Thanks for your interest and comments! We are aware of many cases where editing and other changes were needed, but because many people requested a copy, sent that rough version. We have been working on the files meanwhile and will be sure to include the corrections you suggest, though we have caught quite a number already. We have access routines written for some of the information and are continuing that effort to integrate the data into a lexicon for the CODER system. We are trying to revise NU-Prolog, available from Melbourne University, to include all the changes we made to MU-Prolog so that CODER would work. It appears that NU-Prolog can handle such a large Prolog fact base. In the meanwhile, we have loaded some of the data into a B-tree using the C-tree (TM) software. - Ed.] Then, as I am writing to you anyhow, something else. You have been mentioning a CD-Rom to be brought out by your group. How compatible is the CD-Rom scene at the moment? Will CDs formatted for let's say IBM PC be readable on Mac, on ATARI ST or on VAX? What system will your CD be for? [Note: There is an ISO standard for CD-ROM file formats, so though we are aiming our CD-ROM for IBM PC or similar systems, it should hopefully be usable on any "conforming" CD-ROM drive and computer. - Ed.] Best wishes, Hans van Halteren COR_HVH @ HNYKUN52 ------------------------------ END OF IRList Digest ********************