IRList Digest Thursday, 6 August 1987 Volume 3 : Issue 21 Today's Topics: Email - Problems, plans for IRList Address - Dr. M.B. Koll, Personal Library Software Query - Contact for obtaining SMART - Suggestions for providing online access to Canadian Tax Act Seminar - Responsa system demonstration - Short-context disambiguation in large text databases News addresses are ARPANET: fox@vtopus.cs.vt.edu BITNET: foxea@vtvax3.bitnet CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq ---------------------------------------------------------------------- Date: Thu, 6 Aug 87 15:50:33 edt From: fox (Ed Fox) Subject: electronic mail problems and plans relating to IRList 1. Recent problems Two weeks ago we had lightning hits that caused around $40K of damage to our departmental computers. The machine that IRlist is usually composed on was down for that period, so it has been difficult to get news out. I will attempt to catch up on this in the next week. If you sent in news and it does not appear soon, please send your communication in again, since some messages were lost. I apologize for any inconvenience. 2. Disappearance of seismo as UUCP connection By 1 September, the machine called "seismo" that is at the Center for Seismic Studies will stop serving as a polling center for UUCP mail. Please stop using seismo!vtisr1!fox as a UUCP address to reach me. We will have our machine "vtopus" connected to several other UUCP machines, so fox@vtopus.uucp or an address with the appropriate route should work as a replacement. I do not encourage UUCP traffic, but if it is necessary, use vtopus!fox rather than vtisr1!fox since vtisr1 is becoming more isolated than before. 3. Connection to the ARPANET By early September there will be some changes, hopefully improvements, with IRList mail handling. The main point is that our machine "vtopus" will eventually become the central point for all IRList business. Virginia Tech is now part of SURANET, which is part of NSFNET, and so we are on the DARPA Internet. When we get all the addressing and other software issues corrected, vtopus will be accessible for FTP and other services. I will post information when it is available and when we have finished testing. At that time, people who want access to back issues in quantity will be able to get direct access; up till then I will honor requests for small numbers of back issues. Later, vtopus will also be on BITNET, so UUCP, ARPANET, and BITNET mail will be from one place. 4. Interim situation Meanwhile, please try to send mail to my BITNET address, foxea@vtvax3.bitnet, which will always remain as an option for reaching me. ARPANET and CSNET members can reach that with address foxea%vtvax3.bitnet@wiscvm.wisc.edu and BITNET members can reach it directly. The address for vtopus is now and will continue to be fox@vtopus.cs.vt.edu but I prefer it not be used a great deal till our ARPANET connection is perfected. 4. Help with address changes Please notify me in advance if you change address or wish to drop your subscription, unless you are handling these matters with someone who maintains a local redistribution. Please try to give complete addresses, and if it is not obvious, indicate if your address is relative to BITNET or ARPANET or UUCPNET since it is sometime hard to reach people. If you stop receiving IRList, be sure to let me know and we can try to see what happened - I drop people when mailers tell me messages are not getting through. Thanks for your patience! - Ed ------------------------------ Date: Thu, 6 Aug 87 15:58:43 edt From: fox (Ed Fox) Subject: Announcement from Dr. Matthew B. Koll Dr. Matthew B. Koll has asked me to announce his new address: Personal Library Software 15215 Shady Grove Road Rockville MD 20850 (301) 926-1402 He is no longer with George Mason University, and has shifted efforts from his former company, KNM Inc., which marketed SIRE, to devote full time to Personal Library Software. They have a package which is an enhanced version of SIRE. Dr. Koll does not now have an ARPANET address, so should be contacted directly at the address above. He may have openings for experienced C programmers who are knowledgeable about information retrieval, and have some background in UNIX. ------------------------------ Date: Fri, 24 Jul 87 16:28:09 PDT From: George Cross Subject: SMART Hi, Do you have a contact for getting a copy of SMART from Cornell? I remember seeing a license agreement posted some time ago and Don Kraft ordered one for LSU. Thanks. ---- George - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - George R. Cross cross@cs1.wsu.edu Computer Science Department ...!ucbvax!ucdavis!egg-id!ui3!wsucshp!cs1!cross Washington State University faccross@wsuvm1.BITNET Pullman, WA 99164-1210 Phone: 509-335-6319 or 509-335-6636 [Note: contact chrisb@cornell.arpa by electronic mail, or write to Professor Gerard Salton at Cornell. - Ed] ------------------------------ Date: Fri, 10 Jul 87 17:06:49 EDT From: seismo!mnetor!lsuc!dave Subject: Indexing of a complex statute for on-line retrieval We at the Law Society of Upper Canada are responsible for post-law school legal education in Ontario, both for call to the Bar (the Law Society governs the legal profession in the province and admits new members through the Bar Admission Course) and for continuing legal education. We've been using CAI for several years, particularly to teach Canadian income tax law. Our tax courses are taken by over 1,000 students a year plus a number of lawyers and others, and we're developing more advanced courses for lawyers' use. We have the opportunity to acquire an on-line version of the (Canadian) Income Tax Act, a rather massive statute. In its published version, along with history of changes, regulations and various minor annotations, it's over 1400 pages. I'm told the raw on-line data is something like 5-10Mb. The publisher is interested in us putting the Act up on our system so they can gain experience in the "electronic publishing" field, and learn how it might be used and how it can best be organized for retrieval. They are therefore willing to let us have it for free. My interest is in making this tremendously useful information available to people who are on our system anyway for studying tax through CAI. If the experiment is successful, we might look to putting other primary and secondary tax sources on-line in the future. Ours is a UNIX system, a Perkin-Elmer 3220 (roughly the power of a VAX-11/750) running UNIX version 7. We're educational source-licensed for UNIX and can upgrade the license to System V if necessary. My question is: how should I go about putting the data up on-line? (We'll be getting the data in raw ASCII form from a different system.) We don't have a lot of time to devote to this, as we're very busy with other projects. Are there existing tools I can make use of? At the most primitive level, I imagine I would just stick the data into a UNIX file and give people existing tools like "grep" and "more" for searching and browsing through it. I can imagine indexing the section and subsection numbers too, perhaps by location in the file so the user could seek to the right provision quickly. I'm a real novice in the field of information retrieval, however. I'd appreciate any suggestions as to (1) quick solutions or existing tools which will make the data more usable; (2) references to literature on storage/retrieval of complex statutes; and (3) specific ideas of more complex indexing or retrieval mechanisms that we might implement down the road. Many thanks. David Sherman Computer Education Facility The Law Society of Upper Canada Osgoode Hall Toronto, Canada M5H 2N6 dave@lsuc.uucp +1 416 947 3466 { seismo!mnetor pyramid!utai decvax!utcsri ihnp4!utzoo } !lsuc!dave [Note: There are various retrieval packages that might work. The SMART system is available from Cornell for a nominal charge, but may not run on your hardware/software. The Personal Librarian would probably work and Matt Koll could tell you. See other msgs in this digest for contact information about these two systems. There are many others around, and many people working on legal information retrieval - I hope some will contact you with details and you will let us know what you decide. - Ed] ------------------------------ Date: Thu, 6 Aug 87 16:49:24 edt From: fox (Ed Fox) Subject: Demonstration of RESPONSA System YOU ARE INVITED TO AN ONLINE DEMONSTRATION OF THE RESPONSA SYSTEM An advanced full-text retrieval system (with morphological processing) for 2000 years of Rabbinical Literature by Yaacov Choueka Bell Communications Research Morristown, New Jersey (on sabbatical leave from the Department of Mathematics and Computer Science Bar-Ilan University, Ramat-Gan, ISRAEL) WHEN: Wedn. August 12 from 1:30 - 3pm, and 7:30 - 9pm WHERE: Newman Library, 6th floor board room WHAT: Come and stop by if you would like to see * An interesting full-text retrieval system with a remarkably fast response time (despite some "hostile" parameters such as the size of the database, the complexity of the search, the long and not-so-reliable telephone communications lines to Israel, and the 1200-baud transmission rate). * An automatically lemmatized (in a context-free sense) 50-million words corpus (probably the only lemmatized one of this size in any language). *A complete morphological component embedded in an operational retrieval system. * An online module for accurate and complete morphological analysis of any word in the language. * Some beginnings of applications of a short-context approach (how many different "following neighbors" are there for a given ambiguous word with 200,000 occurrences? How many of these neighbors occur more than 1000 times, and which are they? Do they disambiguate the given word? How can this information be used in on-line retrieval or dictionary building contexts?). WHO: Dr. Choueka has almost twenty years of experience in teaching and research in computer science, some of it (in the early years) in finite automata and formal languages theory, but most of it in information retrieval, computational linguistics and text processing. He was part of the team that initiated the RESPONSA in 1966, and has served as its Director and Principal Investigator since 1975. ------------------------------ Date: Thu, 6 Aug 87 16:50:05 edt From: fox (Ed Fox) Subject: Seminar on Disambiguation COMPUTER SCIENCE SEMINAR McBryde Hall Room 201 Wedn. August 12, 10:15 - 11:30AM Short Is Beautiful: Short-context disambiguation in large textual databases by Yaacov Choueka Bell Communications Research Morristown, New Jersey (on sabbatical leave from the Department of Mathematics and Computer Science Bar-Ilan University, Ramat-Gan, ISRAEL) ABSTRACT: Morphological disambiguation (i.e., finding the intended "correct" meaning of an ambiguous word in a specific context) is an intellectually challenging and practically important issue in automatic text processing. One of the suggested pragmatic approaches, specially viable for large textual databases, the short-context method, proposes to use the (very) short context of an ambiguous word as an adequate vehicle for its disambiguation. An experiment carefully designed to test this idea and its validity was developed and applied to a small French corpus some time ago, and the results were recently reported elsewhere. Based on the clearly positive outcome of this test, an online short-context disambiguation program was incorporated as an operational component in the Responsa full-text retrieval system (Hebrew, 50 million words), and is being now tested on a large scale. Using this program, the user can submit a word W to the system, which will respond by instantly displaying a list of all the different right (left) neighbors of W in the database, together with the neighbor's "local" frequency (its frequency as a neighbor of W), ranked by the local frequencies. Preliminary findings show that more often than not such a short context of the word is enough to correctly disambiguate its appropriate occurrences. If needed, however, a further expansion of the right neighbor into the corresponding set of its right ones can again be displayed, giving the set of all the different two-word right contexts of the word under examination. It was found that, in general, no more than a few minutes are required for a casual user to decide on the intended meaning of an ambiguous W in its most frequent contexts, thus resulting in the immediate disambiguation of thousands of occurrences of W in the text. When automatically recorded, the user's decisions can greatly help in achieving a "context-sensitive" lemmatization of the corpus, once its "context-free" one has been completed. The method is also very useful in information retrieval contexts, where it gives the user an efficient tool for specifying, in a query with an ambiguous word, which of the word's contexts should be retrieved, thus greatly enhancing the precision of the retrieval. Finally, it is expected that by gradually accumulating these disambiguation decisions in the appropriate word-entry of the available automatic dictionary of the language, "local expert systems" for many ambiguous words will develop, that can greatly facilitate ambiguity resolution in practical situations. ------------------------------ END OF IRList Digest ********************