IRList Digest           Thursday, 7 April 1988      Volume 4 : Issue 17

Today's Topics:
   Query - Search engine usable for literary-critical and linguistics study
         - Graphical display of online thesauri
         - System design for vertical format recording
         - Info. on recent events relating to IR
         - IR and AI - work on parsing?
         - NPL collection, commercial retrieval techniques
         - CED Prolog fact base, CD-ROM standards

News addresses are
   Internet or CSNET: fox@vtopus.cs.vt.edu
   BITNET: foxea@vtvax3.bitnet

----------------------------------------------------------------------

Date:         Sun, 6 Mar 1988 10:50 CST
From:         Robin C. Cover <ZRCC1001@SMUVM1>
Subject:      ACM SIGIR FORUM (posting for textual-studies search engine)

Dear Professor Fox,

Appended below in a revised/corrected version of a posting I sent last
night for IRList. I later uncovered one of the general information
sheets for IRList and discovered that there's something called SIGIR
FORUM, which is connected to the ACM.  I believe that the audience I
wish to reach with this posting is the group of computational linguists
who do textual studies in a sort of "humanities" arena.  In any case, if
you could post the message on the SIGIR FORUM, I would be most grateful.
 . . .
[Note: I trust that Bill Frakes and Vijay Raghavan will pick this
news item up for the next Forum - they are the current co-editors and
will sometime give their addresses and a plea for materials in IRList,
I am sure. - Ed.]

Many thanks if you can help.
Professor Robin C. Cover
===========================================================


     I'm looking for a search engine which probably does not exist, but
I would like advice from those more knowledgable about text retrieval
systems.  It is a text retrieval system optimized for literary-critical
and linguistic study.  The major requirements for the search engine are
as follows:

    (1)  Literary texts should be "understood" by the system in terms of
the individual document structure, as indicated by markup elements. The
user should be able to specify within a search argument that proximity
values, positional operators, comparative operators and logical
operators govern the search argument and the textual units
to-be-searched IN ACCORDANCE WITH THE HIERARCHICAL STRUCTURE OF THE
DOCUMENT.  That is, if a document is composed of books, chapters,
pericopes, verses and words, then expressions within the search argument
must be able to refer to these particular textual units.  If another
document (or the *same* document, viewed under a different hierarchical
structure) contains chapters, paragraphs, sub-paragraphs, (strophes),
sentences and words, then expressions in the search argument should be
framed in terms of these textual units.  To borrow a definition of
"text" from the Brown-Brandeis-Harvard CHUG group: the text retrieval
system must be capable of viewing each of its documents or texts as an
"ordered hierarchy of content objects (OHCO)."

    (2)  The database structure must be capable of supporting
annotations (or assigned attributes) at the word level, and ideally, at
any higher textual level appropriate to the given document.  Most
record-based retrieval systems cannot accommodate the word-level
annotations that textual scholars or linguists would like to assign to
"words."  More commonly, if such databases can be modified to
accommodate annotations at the word level, the record-field structure is
thereby contorted in ways that introduce new constraints on searching
(the inability to span record boundaries, for example).  Preferably,
even the definition of "word" ought not to be hard-coded into the
system.  Hebrew, for instance, contains "words" (graphic units bounded
by spaces) which may embody three or four distinct lemmas.  Minimally,
the database must support annotations at the word level (e.g., to
account for the assignment of lemma, gloss, morphological parse,
syntactic function, etc) and these annotations must be accessible to the
search engine/argument.  Though not absolutely required, it is desirable
that attributes could be assigned to textual units above "word," and
such attributes should be open to specification in the search argument.
Linguists studying discourse, for example, might wish to assign
attributes/annotations at the sentence or paragraph level.

    (3)  The search engine should support the full range of logical
operators (Boolean AND, OR, NOT, XOR), user-definable proximity values
(within the SAME, or within "n" textual units of various levels),
user-definable positional operators (precedence relations governing
expressions or terms within the search argument) and comparative
operators (for numerical values).  The search argument should permit
nesting of expressions by parentheses within the larger Boolean search
argument.  Full regular-expression pattern matching (grep) should be
supported, as well as macro (library/thesaurus) facilities for
designating textual corpora to be searched, discontinuous ranges or
text-spans within documents, synonym groups, etc.  Other standard
features of powerful text retrieval systems are assumed (set operations
on indices; session histories; statistical packages; etc).

     Most commercial search engines I have evaluated support a subset of
the features in (3), but do very poorly in support of (1) and (2).  The
text retrieval systems which claim to be "full text" systems actually
have fairly crude definitions of "text," and attempt to press textual
data into rigid record-field formats that do not recognize hierarchical
document structures, or are not sufficiently flexible to account for a
wide range of document types.  Three commercial products which attempt
to support (1) are WORDCRUNCHER, Fulcrum Technology's FUL-TEXT and
BRS-SEARCH.  I know of no systems which intrinsically support
requirement (2), though LBASE perhaps deserves a closer look, and a few
other OEM products promise this kind of flexibility.  It may be possible
to press FUL-TEXT or BRS-SEARCH into service since both have some
facility for language definition.  Another promising product is the PAT
program being developed by the University of Totonto in connection with
the NOED (New Oxford English Dictionary).  But I may have overlooked
other commercial or academic products which are better suited for
textual study, or which could be enhanced/modified in some fashion other
than a bubble-gum hack.  It is not necessary that a candidate possess
all of the above features, but that the basic design be compatible with
extending the system to support these functional specs, and that the
developers be open to program enhancements.  Ideally, such a system
would work with CD-ROM, though this is not an absolute requirement.  I
would like good leads of any kind, but particularly products that could
be leased/licensed under an OEM agreement...for microcomputers, I should
add.

     Thanks in advance to anyone who can suggest names of commercial
packages or academic software under development which meet the major
requirements outlined above, or which could be *gently* bent to do so.
I will be happy to report a summary of responses to this posting if
there is general interest.

Professor Robin C. Cover
ZRCC1001@SMUVM1.bitnet
3909 Swiss Avenue
Dallas, TX   75204
(214) 296-1783, 824-3094

[Note: There are many people interested in this type of system.  Two
groups working on related efforts are the folks at UNC Chapel Hill
(including Steve Weiss and John Smith) and the group at U. of Chicago
(including Scott Deerwester in Graduate Library School).  There are
also efforts at Bellcore and OCLC.  I hope some of the people involved
will reply to you and that you can summarize your observations. - Ed.]

------------------------------

Date: Mon, 7 Mar 88 08:45:14 CST
From: Richard Pollard <rpollard@CSD4.MILW.WISC.EDU>
Subject: IRList Digest

In "Connecting and Evaluating Thesauri: Issues and Cases,"
International Classification 14(2) 1987, Roy Rada writes:

"MeSH has been placed on several computers at NLM for the exploration
of graphic interfaces to MeSH.... the user can enter a term and be
graphically shown the hierarchy of terms around it or can choose to
traverse the thesaurus from top to bottom by mouse-activating terms on
the screen."

Two questions:
1) Are there any reports available from NLM about this work?
2) Does anyone know of any other work going on in the area of
   graphical display of online thesauri?

Richard Pollard
School of Library and Information Science
University of Wisconsin--Milwaukee
P.O. Box 413
Milwaukee, WI 53201
<rpollard@csd4.milw.wisc.edu>

[Note: Roy Rada does not have access to his mail at this time - hope
another at NLM will reply on his behalf.  I think Don Crouch may be
doing some related work.  Any one else recall relevant projects? - Ed.]

------------------------------

Date: 16 Mar 88 01:20:56 GMT
Subject: Submission for comp-theory-info-retrieval
From: randolph@mfbbs.UUCP (Verle Randolph)
Newsgroups: comp.theory.info-retrieval
Subject: Info wanted: vertical format recording
Sender: marc@mfbbs.UUCP
Reply-To: randolph@mfbbs.UUCP (Verle Randolph)


	Does anyone have any information on total system design for vertical
format recording?  PLEASE reply via mail.

---
 Verle Randolph     UUCP: ...!rutgers!pbox!romed!mfbbs!randolph

------------------------------

Date:     Fri, 18 Mar 88 17:27 EDT
From: LEWIS@UMass
Subject:  more info, anyone?

Dear IRLIST:
   There were several things in issue 4.9 I'd be interested in seeing
follow-ups posted on if anyone has the time.  To wit:
   --If someone attended the Microsoft CDROM conference I'd be interested
in hearing about that.  In particular, what's up with ACM and database
products, and what is the "new Full Text SIG"?

[Note: I attended as did some 2000 others.  I really like the talk by
Henry Kucera of Brown on "The Texture of Text".  We did a demo of
Virginia Disc 1 which should be out within the next month. There was a
SIGIR session which included presentations by several from ACM
headquarters about ACM and ACM activities in CDROM and electronic
publishing.  A very nice technical discussion of CDROM performance
modeling work at U. of Waterloo was given then too by D. Ford.  There
are lots of CD-ROM products now, though few really innovative
retrieval systems.  CD-I and DVI are still progressing.  There is a
big push now for multimedia programming, and tools for more
conventional CDROM authoring are becoming available. - Ed.]

   --Anyone read the Goldmann book "Online Research and Retrieval with
Microcomputers" and want to post a review?
   --Anyone attend the Salton & McGill discussion on why IR research
doesn't get applied, and want to post a summary?
   Someday I'll attend something and summarize--I promise!

David D. Lewis                         CSNET: lewis@cs.umass.edu
COINS Dept.                            BITNET: lewis@umass
University of Massachusetts, Amherst
Amherst, MA  01003

------------------------------

Date:     21 Mar 88 15:16:00 EST
From:     James (J.G.) Borynec <BORYNEC@BNR>
Subject:  Information Retrieval and AI.

Sir,
     I am currently investigating the linkage between Information
retrieval and Natural Language understanding.  On the face of it,
it doesn't seem that hard to parse technical documentation (at least
to the Noun phrase level) then create some logical representation
for that phrase, and use it for both indexing and retrieval.

This parsing wouldn't have to be that sophisticated for it to be much
better than the current boolean keyword techniques (I hope!).

Have you ever heard of any activities related to this idea.  Do you
know any good references?  I am trying to gather enough information
to convince my boss (Dan Zlatin) that this is a worthwhile activity for
me to pursue.

Thanks in advance ... James Borynec (BORYNEC@BNR.BITNET)

[Note: There has been ongoing interest in this type of approach.  One
presentation at RIAO about work at MIT was provocative but I did not
see quantitative measures of the results.  Sparck Jones & Tait,
Smeaton, Fagan, Croft (and several students) and others have all done
work in recent years - please forgive me for not mentioning others and
send in some replies about your activities! - Ed.]

------------------------------

Date:     Wed, 23 Mar 88 20:12 EDT
From: LEWIS@UMass
To:   foxea@vtvax3.bitnet
Subject:  queries on NPL collection, commercial retrieval techniques

Dear IRLIST readers:

Query 1:
Does anyone out there have access to the original text of the queries and
documents in the NPL test collection? Our only copy has had punctuation,
numerals, and abbreviations removed--fine for keyword-based retrieval, but
not desirable for parsing! Also (and this is probably wishful thinking), has
anyone ever written up a layman's description of what the standard queries
mean? (What the heck are "multiple digit techniques in font decimal
address"?!) In a number of cases it's hard for me even to be sure where
attachment in a syntactic parse should be. (If someone's got parse trees for
the queries, that would be useful, too.)

Query 2:
Does anyone know of a recent work summarizing the retrieval techniques that
are actually in use in operational retrieval systems, especially by on-line
database companies.  The most recent thing I've seen is soem stuff in
Salton & McGill's 1983 book.

Many thanks,

David D. Lewis                         CSNET: lewis@cs.umass.edu
COINS Dept.                            BITNET: lewis@umass
University of Massachusetts, Amherst
Amherst, MA  01003

------------------------------

Date:     Tue, 5 Apr 88 10:50 N
From:     <COR_HVH@HNYKUN52>
Subject:  CED Prolog Fact Base

Dear Ed,

as you may remember I have a copy of the fact base tape you sent to Oxford.
It was the first tape you sent them (in case you sent more since) and I was
probably the first to order it as my order was in Oxford before the tape.

Up until last week we only used the material for incidental checking as the
info we wanted (syntactic properties) was not in a usable form (only part of
speech explicitly present). Last week, however, having nothing better to do
the last few days before Easter, I tried to transform the data to something
more accessible. I aborted the attempt after a while but will try again.
Below is a description of where I went wrong and some inconsistencies I found.
You probably have found them yourself by now, but then you might not.
 . . .

Those were the consistent glitches I found. I would be interested in newer
or reformatted editions of the fact base. Please keep me informed.

One more question about the fact base: do you actually have a Prolog system
which can contain all this stuff?

[Note: Thanks for your interest and comments!  We are aware of many cases
where editing and other changes were needed, but because many people
requested a copy, sent that rough version.  We have been working on the
files meanwhile and will be sure to include the corrections you suggest,
though we have caught quite a number already.  We have access routines
written for some of the information and are continuing that effort to
integrate the data into a lexicon for the CODER system.  We are trying
to revise NU-Prolog, available from Melbourne University, to include
all the changes we made to MU-Prolog so that CODER would work.  It
appears that NU-Prolog can handle such a large Prolog fact base.  In
the meanwhile, we have loaded some of the data into a B-tree using the
C-tree (TM) software. - Ed.]


Then, as I am writing to you anyhow, something else. You have been mentioning
a CD-Rom to be brought out by your group. How compatible is the CD-Rom scene
at the moment? Will CDs formatted for let's say IBM PC be readable on Mac, on
ATARI ST or on VAX? What system will your CD be for?

[Note: There is an ISO standard for CD-ROM file formats, so though we
are aiming our CD-ROM for IBM PC or similar systems, it should
hopefully be usable on any "conforming" CD-ROM drive and computer. - Ed.]

Best wishes,

Hans van Halteren
COR_HVH @ HNYKUN52

------------------------------

END OF IRList Digest
********************