IRList Digest Thursday, 28 July 1988 Volume 4 : Issue 42 Today's Topics: Email - Address for Charles Meadow Query - Definition of hypertext/hypermedia Reply - Suffixing, stemming Discussion - Metamorph, stemming, online search style - Online search style - Metamorph Announcement - Forum on small-systems database products - NTIS demo on Japanese research - Thesis defense on comparing extended Boolean schemes News addresses are Internet or CSNET: fox@vtopus.cs.vt.edu or fox@fox.cs.vt.edu BITNET: foxea@vtvax3.bitnet (soon will be foxea@vtcc1) ---------------------------------------------------------------------- Date: Wed, 27 Jul 88 08:58:06 CST From: Jeff Huestis Subject: Address for Charles T. Meadow Ed: do you have an email address for Charles T. Meadow? ... --Jeff ------------------------------ Date: Wed, 27 Jul 88 14:47 EDT From: VENTURA%21514%atc.bendix.com@RELAY.CS.NET Subject: What exactly IS "hypertext"/"hypermedia"? Does anyone have a good (succinct) definition of what hypertext/-media is? I am trying to figure out whether or not an application I am working on qualifies. CA Ventura ------------------------------ Date: Fri, 22 Jul 88 16:42:38 EDT From: Donna Harman Subject: reply to stemming query in IRDIGEST [Note: to send mail to Donna, do not use the above address (at least I could not get it to work) - instead try harman%icst-nav@icst-osi.arpa Be careful in later correspondence since "Reply" may use the one you see above under "From" rather than what I have given. - Ed.] I don't know how to reply to the IRDIGEST, so I am trying it this way. [Note: You did fine - use addresses in the header of each IRList or as explained in the Welcome message. - Ed.] Reply to the query on suffixing: In interest of answering the actual question, I am supplying four references--my paper on stemming performance, and three papers on actual algorithms. Harman D., "A Failure Analysis on the Limitation of Suffixing in an Online Environment", Proceedings of the Tenth Annual International Conference on Research and Development in Information Retrieval, New Orleans, 1987. Lovins J.B., "Development of a Stemming Algorithm", Mechanical Translation and Computational Linguistics 11, March 1968. (this is the description of the Lovins stemming algorithm which has been extended for use as the SMART stemmer). Porter M.F. "An Algorithm for Suffix Stripping", Program, Vol 14, July 1980. (this is a newer algorithm, removing fewer stems) Ulmschneider J. and Doszkocs T. "A Practical Stemming Algorithm for Online Search Assistance", Online Review 7(4), 1983. (this is a description of how to tailor-build a stemming algorithm for a given collection) In interest of rabid discussion on stemming, I will put forth the following strawman for debate. Stemming is not an improvement on full word retrieval except in two situations: 1) storage is a problem--stems store in less space, although the inverted file is not smaller (same number of postings, just organized under a smaller number of terms) 2) the number of documents is small and/or recall is much more important than precision. Fire away! [Note: Since there are conflicting results regarding the value of stemming and that seems to depend on the stemming algorithm and the collection being used for the tests, why not just try to figure out what combination of cases is best rather than make such a categorical statement as you have done above? - Ed.] ------------------------------ Date: Mon, 25 Jul 88 19:14:54 EDT From: MARCUS@Lids.mit.edu (Richard Marcus) Subject: Metamorph Stemming Search Costs and Style Ed, I have comments on three subjects in recent IRList Digests which seem to be interrelated in various ways: (1) Metamorph -- Ed, I admire your restraint in attempting to report on this effort which has received so much hype and provided so little technical details by which to judge it. I don't have any more details on Metamorph as such, but there was an interesting article in BYTE (May, 1988; p 297ff) by Roy E. Kimbrell which describes an apparently related "N-Gram" method attributed to Raymond D'Amore and Clinton Mah of PAR Government Systems Corp (McLean, VA). This N-Gram approach uses many [Note: full address is 1840 Michael Faraday Dr., Suite 300, Reston, VA 22090-5341 and switchboard is 703/478-9690 - Ed.] of the Salton SMART techniques (weighted vectors, cosine matching, clustering, stemming, etc.) but applied to letter strings, or n-grams, WITHIN words. Although I would argue against statistical, non-word methods as techniques of CHOICE, at least the methods are reasonably well explained and some indication of experiments with a test corpus is given (but no details or comparison with other methods). (2) Ed, your pointers to Aalbersberg [IRLD:4(38)] on stemming were good starters. Coincidently, a stemming (conflation) algorithm in the C programming language is given by Kimbrell in the above-mentioned Byte article. Let me also add that Julie Lovins, a linguist, developed a nice stemming algorithm under our Intrex Project (Lovins, Mechanical Translation 11:22-31[1968]) which has been used to good effect by us and a number of other organizations. A useful evaluation of the algorithm was reported by Julie in the Journal of ASIS [22(1)28-40; January, 1971]. [Note: the Lovins method is the basis for what is used in SMART - Ed.] One interesting point is how drastically the evaluation depends on the context. Salton has, I believe, reported on small but significant effectiveness for simple stemmers in SMART. Donna Harman has reported (Proceedings 1988 RIAO Conference, pps 839-848) on experiments with the NLM IRX system that stemming doesn't help at all. Harman suggests that the IRX batch oriented context might be the reason for non utility of stemming and an interactive context would probably yield different results. Our own research supports the latter; experiments with our highly interactive CONIT system (see, e.g., Marcus, Journal ASIS 34(6):381-404; Nov., 1983) have demonstrated the critical importance of stemming in that context. (3) Costs Affecting Search Styles -- Bill Joel (supported by Jeff Huestis) is right on! Cost is a critical component of context. The Telebase Easynet front end system owes a large part of its success to techniques for holding down online costs. We have reported (see, e.g., Marcus, Proceedings ASIS 85; 22:289-292) how cost factors markedly influence search behavior online. Despite exponential increases in benefits/costs factors, we have not yet reached the point where online users can derive anything like the full effectiveness of the interactive capabilities on computers (although we're working toward that goal with our 'smart Boolean' approach). ---Dick Marcus, MIT Lab for Information and Decision Systems... ------------------------------ Date: 25 Jul 88 17:03:00 EDT From: Nahum (N.) Goldmann Subject: Please post. Thanks. (re:Do online costs affect search styl In response to Dr.Joel's request on IRLIST, the key-factor in negotiating a search online under the pressure (cost) is the KNOWLEDGE OF THE SEARCH SUBJECT. I discussed this in detail in Chapters 2 and 10 of my book (ONLINE RESEARCH AND RETRIVAL, TAB Professional and Reference Books). This knowledge is generally associated with the END-USER of information, as opposed to the INTERMEDIARY (information brocker). Your analogy with library is entirely correct, except that a sane specialist would never ask a librarian to search at the stacks on his/her behalf (precisely because it has to be interactive). I believe that it is better to negotiate online for some (the end-user) but is necessary to define beforehand for the others (the intermediary). Nahum Goldmann acoust@bnr Tel. (613)763-2329 ------------------------------ Date: Sun, 24 Jul 88 10:10:09 EDT From: Tung-Ying Chang Subject: Metamorph Dear Professor Fox, I have received volume 4 issue 36-40 and try to review the comments /materials which you mentioned in issue 40. I read the article "Word ladders and a tower of Babel lead to computational heights defying assault" in Scientific American Aug. 1987. I consider that this is the article which Defense Science mentioned in regard to Bell Lab's research. There is not technical details but general description. I agree with you that we don't need to discuss commerical systems unless there is something new. I suspect most of "new things" are covered with commerical secret. Anyway, I am interested to web structure and morpheme retrieval. Thank you very much. Good luck. Tung-Ying Chang ~ Tung-Ying Chang Professor Fox 7/24/88 Metamorph ------------------------------ Date: Mon, 25 Jul 88 15:04:27 EST From: "James S. Cowie" Subject: PCDBMS-L at YALEVM Greetings, IRLIST people... Just a brief note to inform you that due to a great positive response to initial inquiries, there now exists a Listserv forum for discussion of small-systems database products in academic or library contexts. All are welcome. The new list is PCDBMS-L at YALEVM. Products to be discussed include Paradox, NotaBene, Quattro, Dbase, Rbase, DataEase, Reflex, Revelation, etc. yours truly, James Cowie Yale University Library Systems Office ~ James S. Cowie Irlist 7/25/88 PCDBMS-L Acknowledge-To: ------------------------------ Date: Wed, 27 Jul 88 13:07:59 EDT From: Edward A. Fox Subject: NTIS demonstration On Friday July 29 at 2pm (in the Idea Salon in CPAP, at 104 Draper Road, Blacksburg, VA) there will be a demonstration by Tim Feinstein of NTIS of their system to access Japanese research work. All are invited. For more information, contact John Dickey, Center for Public Administration and Policy, VPI&SU (703) 961-5133/5830. ------------------------------ Date: Wed, 27 Jul 88 13:04:49 EDT From: Edward A. Fox Subject: defense Whay C. Lee will have his MS thesis defense on Friday, July 29 at 10am in McBryde room 558. The title of his thesis is "Experimental Comparison of Schemes for Interpreting Boolean Queries". All are invited. - Ed Fox ------------------------------ END OF IRList Digest ********************