IRList Digest Thursday, 16 October 1986 Volume 2 : Issue 55 Today's Topics: Query - Addresses of Robertson and Croft - List of outstanding problems in information retrieval Discussion - SOUNDEX: why not consider Davidson's algorithm ? - SOUNDEX: what about canonical form for given names etc.? - SOUNDEX: 92 lines of code included Article - Abstract on Geometric Analysis of Similarity Measures News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq ---------------------------------------------------------------------- Message-Id: <8610130807.AA04199@cernvax.UUCP> From: Hans-Peter Giger Date: Mon, 13 Oct 86 08:07:21 -0100 Subject: email adress unknown Hi does anybody know the email adresses of the following persons: S.E Robertson at City University in London W.B Croft at University of Massachusetts We are looking for algorithms to help us converting weighted search request into series of boolean search request. We are using Elementary Logic Conjuncts as presented by Jamieson and Heine. Thanks a lot Adress: mcvax!cernvax!ethz!giger Hanspeter Giger [Note: The address for Bruce is croft%umass-cs.csnet@csnet-relay.arpa You might contact Chris Buckley (or he might respond to this) about some of the work at Cornell that may relate. I am not sure if S.E. Robertson receives this digest; if he does not and is interested please let him know. - Ed] ------------------------------ Date: Fri, 10 Oct 86 09:45:15 edt From: choueka@thunder.bellcore.com (Yaacov Choueka) Message-Id: <8610101345.AA15205@thunder.bellcore.com> Subject: problems in inf. ret. Help I was referred to you by M. Lesk for the following inf.: A few years ago Salton published a list of outstanding problems in inf. ret. (probably in Forum but I am not sure).Do you have any ref. on that? (May be a copy?). Thanks. Y. Choueka (tel. (201) 8295175) [Note: In JASIS, July 1985, 36(4):268-271, G. Salton talks about some of the progress that has been made and some of the areas where more work is needed. In SIGIR Forum 14(4), Summer 83 (also is Proc. 6th Int. ACM SIGIR Conf. on R&D in IR), M.E. Maron had article "Open Problems in Information Retrieval." Perhaps other readers have other references, or would like to send in comments on problems. There was a good discussion at the ASIS Annual Meeting sessions run by ACM SIGIR; perhaps one of the panelist there or a participant would like to summarize the comments made. - Ed] ------------------------------ Date: Sat, 11 Oct 1986 15:37 CST From: Robert H Greenfield Subject: Davidson Code In IRList Vol 2 No 50, R Daniel Johnson comments on Soundex Code in reply to a request for information from Stephen Page (IRList Vol 2 No 44). As a part of my dissertation, I investigated Soundex Code and compared it to Davidson's code. To put it quickly, the Davidson Code is easier to calculate and for most purposes superior to the Soundex. Soundex brought too many non-similar sounding names together. In some cases, Davidson's code didn't bring the desired names together. This study looked for duplicate records in an approximately 25,000 name radiology data base. I don't understand why Davidson's code is ignored as much as it is. Davidson L. Retrieval of Misspelled Names in an Airlines Passenger Record System. Comm ACM; 1962; 5:169-71. Greenfield RH. An Experiment to Measure the Performance of Phonetic Key Compression Schemes. Meth Inform Med; 1977 Oct; 16(4):230-3. R H Greenfield, University of Regina. PS: Where is Vijay Raghavan hiding these days? Acknowledge-To: Robert H Greenfield [Note: Vijay's new network address in Southern Louisiana is raghavan%usl.csnet@csnet-relay.arpa I agree with you that people should look at other algorithms besides Soundex. I had some students compare various methods, and incorporated them in our version of the SMART system a few years ago. We did some retrieval tests and found Soundex to give lower precision but higher recall than either Davidson's 1962 method, or Davidson's method with "m" and "n" conflated. See article in Proceedings of ACM SIGIR Conf. in 1985 for a few more details. - Ed] ------------------------------ Message-Id: Date: Mon, 13 Oct 86 15:00:29 edt From: cfe@andrew.cmu.edu (Craig F. Everhart) Subject: Soundex improvements? When Knuth introduces Soundex, he says that it works pretty well for surnames. Are there other, similar, name-compression techniques that are designed to work well for given names, or for any names? Naively, I could ask about simple, cheap techniques to map ``Bill'' and ``William'' to the same canonical representation--a different canonical representation, of course, from the one to which it maps ``Liz'' and ``Elizabeth'' and ``Libby.'' ------------------------------ Date: Thu, 09 Oct 86 09:38:15 O From: Henry Nussbacher Subject: SOUNDEX I have been using a SOUNDEX-like program in VM for the past year to phonetically find user's last names. There is alot of optimizing that can be done in this area. The program is currently optimized for American last names and not for European names. Example: A name like Finkel would be reduced to 2736 but Finkle would become 27361 since the trailing vowel is deemed to be significant in American names. An Israeli name like Mizrachi would prefer to have the trailing vowel dropped. I leave the optimization up to all computer scientists and hackers. For me it finds a match 98% of the time. The code is written in Rexx and should be fairly easy to understand for those that know PL1 or Pascal. > /* > This exec will take a single character string that is a person's > last name and change it into a numeric index. This index can then > be used to search. > > Syntax: > ====== > > PHONETIC name > The EXEC will stack a number when it leaves. It the responsibility > of the caller to handle the stacked value. > > Phonetic rules: > ============== > > 1) 'W' or 'H' at beginning of word is kept, 'W' or 'H' anywhere else > is thrown out > 2) Vowel at beginning or end of word is unique, anywhere else it > is thrown out > > > Logic: > ===== > > 1) Uppercase the name > 2) Change the following letters: > S,Z to 9 > R to 8 > M,N to 7 > L to 6 > W,H to Z > D,T to 4 > C,G,J,K,Q,X to 3 > B,F,P,V to 2 > anything else to Y > 3) Insert an 'X' at the beginning and the end > 3) Change any 'YX' to '1' > any 'XY' to '1' > any 'XZ' to '5' > 4) Change all single occurrences of 'X', or 'Z' to null > 5) Reduce all double and triple occurrences of the numbers - 2,3, > 4,6,7,8,9 to be single digit in length. > 6) Change all occurrences of 'Y' to null > > Example: > ======= > > A search of the Weizmann NAMES file for MUSPACHER would find > NUSSBACHER. > > > Author: Henry Nussbacher > Date : July 1985 > */ > Arg name /* Pick up the name */ > Upper name /* Upper case it */ > nuname = Translate(name,'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY222233333344ZZ677899', > '|@#$%^&*()_-+=!c|\"}{?/`~AEIOUYBFPVCGJKQXDTHWLMNRSZ') > nuname = 'X'nuname'X' /* Surround the word */ > nuname = Change(nuname,'YX','1') > nuname = Change(nuname,'XY','1') > nuname = Change(nuname,'XZ','5') > nuname = Change(nuname,'Z','') > nuname = Change(nuname,'X','') > nuname = Change(nuname,'222','2') > nuname = Change(nuname,'333','3') > nuname = Change(nuname,'444','4') > nuname = Change(nuname,'666','6') > nuname = Change(nuname,'777','7') > nuname = Change(nuname,'888','8') > nuname = Change(nuname,'999','9') > nuname = Change(nuname,'22','2') > nuname = Change(nuname,'33','3') > nuname = Change(nuname,'44','4') > nuname = Change(nuname,'66','6') > nuname = Change(nuname,'77','7') > nuname = Change(nuname,'88','8') > nuname = Change(nuname,'99','9') > nuname = Change(nuname,'Y','') > Queue nuname > Exit > > > Change: Procedure > Parse Arg string, old, new > If old == '' Then Return new||string > out = '' > Do While Pos(old,string) ^= 0 > Parse Var string prepart (old) string > out = out||prepart||new > End > Return out||string ------------------------------ Date: Mon, 8 Sep 86 10:15:56 cdt From: Bill Jones Message-Id: <8609081515.AA09912@hulk> Subject: abstract submission to IRList Dear Dr. Fox, Enclosed please find an abstract of a manuscript "Pictures of Relevance: A Geometric Analysis Of Similarity Measures" by myself and George Furnas. This manuscript will shortly appear as an MCC tech report and it has also been submitted for possible publication in JASIS. Copies can be obtained by contacting me (William P. Jones) in any of three ways: 1. email - send to "mccabe@mcc.com" ("mccabe" is a family name more distinctive than "jones".) 2. phone - 512-338-3326 3. or by writing me at the address below. I thought it appropriate to give you readers advance notice of this manuscript. Thank you. PICTURES OF RELEVANCE: A GEOMETRIC ANALYSIS OF SIMILARITY MEASURES William P. Jones, Microelectronics and Computer Technology Corporation P.O. Box 200195, Austin, Texas 78720 George W. Furnas, Bell Communications Research, 435 South St. Morristown, N. J. 07960 We want computer systems that can help us assess the similarity or relevance of existing objects (e.g. documents, functions, commands, etc.) to a statement of our current needs (e.g. the query). Towards this end, a variety of similarity measures have been proposed. However, the relationship between a measure's formula and its performance is not always obvious. A geometric analysis is advanced and its utility demonstrated through its application to six conventional information retrieval similarity measures and a seventh spreading activation measure. All seven similarity measures work with a representational scheme wherein a query and the database objects are represented as vectors of term weights. A geometric analysis characterizes each similarity measure by the nature of its iso-similarity contours in an n-space containing query and object vectors. This analysis reveals important differences among the similarity measures under consideration and suggests conditions under which these differences will have an empirically verifiable impact upon retrieval performance. The cosine coefficient, for example, is shown be insensitive to between- object differences in term relevance while the inner product measure is shown to be subject to an unbounded influence from individual query terms. The context-sensitive spreading activation measure may overcome both of these limitations and deserves further study. The geometric analysis is intended to complement, and perhaps to guide, the empirical analysis of similarity measures. [Note: I have read this and found it quite interesting. They use equi-similarity contours similar to those we used in the p-norm extended Boolean query processing studies dating back some 3-4 years, but have applied that to vector forms. They have added in a taxonomy of characteristics of measures that sharpens the effects and differences between measures. And the spreading activation approach has been receiving a lot of attention in the AI community. - Ed] ------------------------------ END OF IRList Digest ********************