IRList Digest Sunday, 28 February 1988 Volume 4 : Issue 14 Today's Topics: Source - Discussion and UNIX code for co-term term relations News addresses are Internet or CSNET: fox@vtopus.cs.vt.edu BITNET: foxea@vtvax3.bitnet ---------------------------------------------------------------------- Date: 8 Feb 88 10:10 +0100 From: wyle%solaris.uucp@relay.cs.net Subject: co-term term relations code attached Hi Ed! Sorry I haven't contributed anything to your digest for such a long time. Care and feeding of the machines here has kept me too busy for many things... Attached is a Unix shell-script which calculates the cosine co-term relationship defined in "Co-Word Search: A System for Information Retrieval," Bertrand Michelet and W.A. Turner, Journal of Information Science, vol. 11, (1985) p 173-181. The relationship is Aij = Cij * Cij / (Ci * Cj) where Cij is the sum of co-occurances of words i and j in all sentences of a collection (max. 1 per sentence), Ci is the sum of the occurances of word i (max. 1 per sentence) in the collection, and Cj is the sum for word j. A different shell script to filter out the 150 most common words in English is also in the shell archive. It takes about half an hour to run these scripts against the CACM document test collection (2.2 Megabytes) and generates a 6 Megabyte report file on a normally loaded Sun-3/280. The object was to get a quick and dirty solution, not an efficient one. I helped teach a course in IR this semester and have some other interesting "example solutions" to assignments in statistical text analysis, all written in Unix shell-script or Modula-2. If anyone is interested, send me e-mail (wyle%ifi.ethz.ch@relay.cs.net) Please let me know if other courses tought on Unix machines exist, and let's share experiences about the exercises. I am curious about others who are using unix shell scripts and commands for IR text analysis. There are a wealth of text filtering utilities built into Unix, including awk, sed, tr, as well as obscure (?) concordance systems like egrep, and ptx. There are also public domain packages such as the humanities (hum) system by Dr. Tuthill, SMART by Chris Buckley & Dr. Salton, etc. [Note: When I started the project to redo SMART for UNIX at Cornell in 1980, we began with a few programs and a bunch of UNIX tools all pieced together. Gradually, to get things more efficient, the UNIX tools and use of Ingres was eliminated. But for teaching purposes, it is a good approach. By the way, there is an effort at U. Chicago, with Scott Deerwester and others, using UNIX to build tools for text processing and to help scholars working with text collections. - Ed.] * * * Ed, you really should wail on your readers to contribute more. I am a member of a different mailing-list where the moderator periodically sends out "Contribute, you leaches!! You know who you are!" messages whenever the quality and volume drop. [Note: Thanks for the encouragement - readers take heed! - Ed.] * * * Enjoy the co-word cosine code! If anyone hacks up improvements, variations, etc. Please let me know! --------------------- cut here --------------------------------------- #! /bin/sh # This is a shell archive, meaning: # 1. Remove everything above the #! /bin/sh line. # 2. Save the resulting text in a file. # 3. Execute the file with /bin/sh (not csh) to create: # p1 # sl # This archive created: Mon Feb 8 09:25:08 1988 export PATH; PATH=/bin:/usr/bin:$PATH if test -f 'p1' then echo shar: "will not over-write existing file 'p1'" else cat << \SHAR_EOF > 'p1' #!/bin/sh # # Term dependency calculation script M F Wyle 21.01.88 # # # Define some constants. Be sure to change these per system! # # debug set -x FILE=/usr/frei/ir/ir/data/testcoll/cacm/docs SRTD=/usr/frei/ir/wyle STOPLIST=./sl TEMPFILE=./tmp.fil REPORTFILE=./report.cacm COMBOFILE=./word.combos COMBOFIL1=./word.combo.j1 FREQFILE=./word.freq # First calculate the occurance of each individual word in all sentences: cat $FILE | tr -cs \.\?\!A-Za-z ' ' | # delete all unwanted chars tr '\.\?\!' '\012' | # separate sentences onto new lines tr A-Z a-z | # convert upper to lower case $STOPLIST > $TEMPFILE # eliminate words in stop-list cat $TEMPFILE | # Eliminate duplicate words from sentence and print each word: awk ' { for (i = 1; i <= NF; i++) dup[i] = 0 for (i = 1; i < NF; i++) { for (j = i+1; j <= NF ; j++) { if ($i == $j) dup[i] = 1 } } for (i = 1; i <= NF ; i++) if (dup[i] == 0) print $i }' | sort | # sort words in alphabetical order uniq -c | # count occurance frequency awk ' # print fields in reverse order { print $2 " " $1 # print word, then occurance }' > $FREQFILE # save in a file ######################################################################## # # Now calculate all word combinations: # ######################################################################## cat $TEMPFILE | awk ' # remove all duplicate words from sentence: # and then print all word pairs from sentence: { for (i = 1; i <= NF; i++) dup[i] = 0 out = "" for (i = 1; i < NF; i++) { for (j = i+1; j <= NF ; j++) { if ($i == $j) dup[i] = 1 } } for (i = 1; i < NF ; i++) { for (j = i+1; j <= NF ; j++) if ( (dup[i] == 0) && (dup[j] == 0) ) print $i " " $j } }' | sort | uniq -c > $COMBOFILE ######################################################################### # # Join the files by 1st word, get: # FirstWord FirstWordFreq FreqTogether SecondWord # ######################################################################### join -j2 2 $FREQFILE $COMBOFILE | sort +3 > $COMBOFIL1 # sort by 2nd word ######################################################################### # # Join the files by 2nd word, get: # SecondWord SecondWordFreq FirstWord FirstWordFreq FreqTogether # ######################################################################### join -j2 4 $FREQFILE $COMBOFIL1 | awk ' { Ajk = $5 / ( sqrt($2) * sqrt($4) ) printf("%-35s %-35s %g\n", $3, $1, Ajk) }' | sort -T $SRTD +2 -rn >$REPORTFILE # # print only word pairs whose Ajk values are between 1 and 0.1 # /bin/rm -f $COMBOFILE # remove work files /bin/rm -f $COMBOFIL1 /bin/rm -f $FREQFILE /bin/rm -f $TEMPFILE SHAR_EOF chmod +x 'p1' fi if test -f 'sl' then echo shar: "will not over-write existing file 'sl'" else cat << \SHAR_EOF > 'sl' #!/bin/sh awk ' { { sl["a"] = 1; sl["b"] = 1 sl["c"] = 1 sl["d"] = 1 sl["e"] = 1 sl["f"] = 1 sl["g"] = 1 sl["h"] = 1 sl["i"] = 1 sl["j"] = 1 sl["k"] = 1 sl["l"] = 1 sl["m"] = 1 sl["n"] = 1 sl["o"] = 1 sl["p"] = 1 sl["q"] = 1 sl["r"] = 1 sl["s"] = 1 sl["t"] = 1 sl["u"] = 1 sl["v"] = 1 sl["w"] = 1 sl["x"] = 1 sl["y"] = 1 sl["z"] = 1 sl["about"] = 1; sl["after"] = 1; sl["against"] = 1; sl["all"] = 1; sl["also"] = 1; sl["an"] = 1; sl["and"] = 1; sl["another"] = 1; sl["any"] = 1; sl["are"] = 1; sl["as"] = 1; sl["at"] = 1; sl["back"] = 1; sl["be"] = 1; sl["because"] = 1; sl["been"] = 1; sl["before"] = 1; sl["being"] = 1; sl["between"] = 1; sl["both"] = 1; sl["but"] = 1; sl["by"] = 1; sl["came"] = 1; sl["can"] = 1; sl["come"] = 1; sl["could"] = 1; sl["day"] = 1; sl["did"] = 1; sl["do"] = 1; sl["down"] = 1; sl["each"] = 1; sl["even"] = 1; sl["first"] = 1; sl["for"] = 1; sl["from"] = 1; sl["get"] = 1; sl["go"] = 1; sl["good"] = 1; sl["great"] = 1; sl["had"] = 1; sl["has"] = 1; sl["have"] = 1; sl["he"] = 1; sl["her"] = 1; sl["here"] = 1; sl["him"] = 1; sl["his"] = 1; sl["how"] = 1; sl["i"] = 1; sl["if"] = 1; sl["in"] = 1; sl["into"] = 1; sl["is"] = 1; sl["it"] = 1; sl["its"] = 1; sl["just"] = 1; sl["know"] = 1; sl["last"] = 1; sl["life"] = 1; sl["like"] = 1; sl["little"] = 1; sl["long"] = 1; sl["made"] = 1; sl["make"] = 1; sl["man"] = 1; sl["many"] = 1; sl["may"] = 1; sl["me"] = 1; sl["men"] = 1; sl["might"] = 1; sl["more"] = 1; sl["most"] = 1; sl["mr"] = 1; sl["much"] = 1; sl["must"] = 1; sl["my"] = 1; sl["never"] = 1; sl["new"] = 1; sl["no"] = 1; sl["not"] = 1; sl["now"] = 1; sl["of"] = 1; sl["off"] = 1; sl["old"] = 1; sl["on"] = 1; sl["one"] = 1; sl["only"] = 1; sl["or"] = 1; sl["other"] = 1; sl["our"] = 1; sl["out"] = 1; sl["over"] = 1; sl["own"] = 1; sl["people"] = 1; sl["right"] = 1; sl["said"] = 1; sl["same"] = 1; sl["see"] = 1; sl["she"] = 1; sl["should"] = 1; sl["since"] = 1; sl["so"] = 1; sl["some"] = 1; sl["state"] = 1; sl["still"] = 1; sl["such"] = 1; sl["take"] = 1; sl["than"] = 1; sl["that"] = 1; sl["the"] = 1; sl["their"] = 1; sl["them"] = 1; sl["then"] = 1; sl["there"] = 1; sl["these"] = 1; sl["they"] = 1; sl["this"] = 1; sl["those"] = 1; sl["three"] = 1; sl["through"] = 1; sl["time"] = 1; sl["to"] = 1; sl["too"] = 1; sl["two"] = 1; sl["under"] = 1; sl["up"] = 1; sl["us"] = 1; sl["used"] = 1; sl["very"] = 1; sl["was"] = 1; sl["way"] = 1; sl["we"] = 1; sl["well"] = 1; sl["were"] = 1; sl["what"] = 1; sl["when"] = 1; sl["where"] = 1; sl["which"] = 1; sl["while"] = 1; sl["who"] = 1; sl["will"] = 1; sl["with"] = 1; sl["work"] = 1; sl["world"] = 1; sl["would"] = 1; sl["year"] = 1; sl["years"] = 1; sl["you"] = 1; sl["your"] = 1; } out = "" for (i = 1 ; i <= NF ; i++) { if (sl[$i] != 1) out = out " " $i } print out }' SHAR_EOF chmod +x 'sl' fi exit 0 # End of shell archive -Mitchell F. Wyle wyle@ethz.uucp Institut fuer Informatik wyle%ifi.ethz.ch@relay.cs.net ETH Zentrum 8092 Zuerich, Switzerland +41 1 256-5237 ------------------------------ END OF IRList Digest ********************