Date: Mon, 4 Nov 85 17:36 EST To: irdis at vpi Subject: IRList Digest V1 #18 Reply-To: IRList%vpi@csnet-relay.arpa IRList Digest Monday, 4 Nov 1985 Volume 1 : Issue 18 Today's Topics: Query - NSF Travel Grant for 1986 R&D in IR Conf. - Pisa? Interactive index-in-context queries Call for Papers - ACL 86 Article - Museums on Disc Announcement - IRList now on Bitnic's database server ---------------------------------------------------------------------- From: Don Date: Fri, 1 Nov 85 09:17:10 cst Subject: pisa travel grant announcement when will IRlist print my Pisa travel grant announcement? [Note: On Oct. 19 issues 15 and 16 of IRlist were both sent out. Unfortunately, they were dated Sep instead of Oct, so people may have been confused, or may have missed one of the two. In any case in Issue 16 the second entry published is classified as Call for Papers - Applications for NSF funds to Pisa Conf. which is a message from Don Kraft of LSU to "Dear World:" Please note that papers are due Jan 15, 1986. - Ed] ------------------------------ From: Mark Zimmermann Date: Wed Oct 30 04:26:42 1985 Subject: interactive index-in-context queries? Summary: I want a fast, simple, interactive browsing tool, and something like an extended index-in-context (where the full text of the document is available instantly when requested from an index item) seems to have a lot of potential. Questions: --what are the difficulties with this approach? --do products exist to do this (esp. on Macintosh or Sun)? --are there good places to read about this in the literature? Appended below is part of a message I sent out to a friend recently, describing in more detail the background of the above task. If anybody can help, please send mail to me here ("zimmer@lll-tis") or at "zim@mitre". Tnx! ^z ************** Here's my current project -- I want to create a fast, interactive, index-in- -context (maybe this is called a "KWIC" = "Key Word in Context"?) to handle multi-megabyte data files (e.g., my collection of 3000 msgs from the past year). I'll describe what I fantasize having on the Mac (which will be limited to 200K or so files) -- and maybe you can comment or help, if you like. I think that it's a pretty trivial project on a Mac or Sun (maybe a few days work, plus time to improve the user interface?). Very tough to do the user interface on a non-bit-mapped-screen such as our standard dumb terminals hooked up to the mainframe at work .... What it looks like: system has a window with scroll bars that shows a chunk of the index-in-context -- the alphabetized (ignoring case) words that are indexed are all lined up in the middle, like: ...azilians have domesticated the aardvark and are using it to ... ...common to be using a pine wood abacus for rapid calculations... ...have domesticated the aardvark and are using it to perform a... ... domesticated the aardvark and are using it to perform a var... etc. (index words are right here ^^^, of course) User scrolls around the index, and when something looks potentially interesting/relevant, clicks on that item and another window opens up showing a big chunk of text (several dozen lines, at least) around that point, also in a scrollable window. Everything happens instantly.... One might also be able to edit, in a very simple way, the index -- besides having a predetermined list of words to ignore (a, an, the...) one might let the user click on an index entry and then hit backspace, or "cut", to delete it.... Implementation: I would preprocess the document to remove tabs, replace s with spaces, etc. Then scan through the document and build up a list (linear array, really) of pointers to the first letter of each word to be indexed (an address, maybe relative to the first entry in the document). Then sort that list so that the 10 (or so) letters after the pointed-to locations come in alphabetical order (ignore case). NOTE: we ignore word boundaries to save time and simplify. We ignore "record" boundaries, to save time and simplify. We probably put a bunch of spaces at the beginning and end of the file, to simplify/eliminate end effects. So, we now have our document (minus s) and our sorted index of pointers into that document. As the user scrolls around in the index, we fetch each line to be displayed by subtracting 40 or so characters around the pointed-to character, then typing out 80 characters or so. Show result in a window. Hold everything in memory at once ... if that's not possible, we have to have some pre-fetching (paging?) set up to get all the text surrounding the areas being viewed in the index into fast memory before user clicks on an item to be fetched. If an item is selected, open another window and print out the pointer location contents +-1000 or so bytes. DQW suggests perhaps showing in the index only the index word plus a count of how many occurrences it has, i.e.: aardvark 1 abacus 2 and 12345 Akhiezer 17 etc., and then with a click expanding a chosen word into its full list of index items. Might be useful as an option, but I'd delay it until we see how unwieldy the full way is (with scroll bars, you can skip over dull zones easily). CS suggests that we could use the above idea to get into sub-indices -- that is, if one clicked on "Akhiezer" above, one might get into an index which was sorted by all the words within 100 (or so?) characters of the occurrence of Akhiezer -- a rather different idea, that might require more pre-processing or auxiliary index files than I want to tackle right now. The index should only take up 1/2 or so the size of the whole document (4 bytes/pointer, and the average indexed word is probably at least 4 letters long, so there should be little chance of the index exceding the size of the document). A left parenthesis is a good delimiter to use, in addition to a space. So, any comments? If this already exists commercially, please give me a pointer to it, so I don't reinvent the whole thing ... I'm told that a KWIC index is a "standard student project", but that it tends not to be too useful ... but with the speed and scrollability that I envision, I think it would be a tremendous tool for me to have a hand in browsing through my mountains of files. I'm going to try to do something quick and dirty in MacFORTH to prove the concept out for this index-context thing ... have discussed it with a variety of friends, and gotten various responses.... Will forward an edited excerpt from this note to IRList and info-mac, and see if there is any help there .... (zim@mitre or zimmer@lll-tis) ------------------------------ From: Don Walker Date: Thu, 31 Oct 85 16:47:40 est Subject: CALL FOR PAPERS; ACL 1986 Annual Meeting CALL FOR PAPERS 24th Annual Meeting of the Association for Computational Linguistics 10-13 June 1986, Columbia University, New York, NY, USA SCOPE: Papers are invited on all aspects of computational linguistics, including, but not limited to, pragmatics, discourse, semantics, and syntax; understanding and generating spoken and written language; linguistic, mathematical, and psychological models of language; phonetics and phonology; speech analysis, synthesis, and recognition; translation and translation aids; natural language interfaces; and theoretical and applications papers of every kind. REQUIREMENTS: Papers should describe unique work that has not been submitted elsewhere; they should emphasize completed work rather than intended work; and they should indicate clearly the state of completion of the reported results. Authors should send eight copies of an extended abstract up to eight pages long (single-spaced if desired) to: Alan W. Biermann ACL86 Program Chair Department of Computer Science Duke University Durham, NC 27706, USA [919:684-3048; awb%duke@csnet-relay] SCHEDULE: Papers are due by 6 January 1986 . Authors will be notified of acceptance by 25 February. Camera-ready copies of final papers prepared on model paper must be received by 18 April along with a signed copyright release statement. OTHER ACTIVITIES: The meeting will include a program of tutorials and a variety of exhibits and demonstrations. Anyone wishing to arrange an exhibit or present a demonstration should send a brief description to Alan Biermann along with a specification of physical requirements: space, power, telephone connections, tables, etc. CONFERENCE INFORMATION: Local arrangements are being handled by Kathy McKeown and Cecile Paris, Department of Computer Science, Columbia University, New York, NY 10027; 212:280-8194 and 8125; mckeown and cecile @columbia-20.arpa. For other information on the conference and on the ACL more generally, contact Don Walker (ACL), Bell Communications Research, 445 South Street, MRE 2A379, Morristown, NJ 07960; 201:829-4312; walker@mouton.arpa or walker%mouton@csnet-relay or bellcore!walker@berkeley. Program Committee: Alan W. Biermann, Duke University Kenneth W. Church, AT&T Bell Laboratories Michael Dyer, University of California at Los Angeles Carole D. Hafner, Northeastern University George E. Heidorn, IBM T.J. Watson Research Center David D. McDonald, University of Massachusetts Fernando C.N. Pereira, SRI International Candace L. Sidner, BBN Laboratories John S. White, Siemens Communication Systems LSA SUMMER LINGUISTIC INSTITUTE: ACL-86 is scheduled just before the 53rd LSA Institute, which will be held at the Graduate School and University Center of the City University of New York from 23 June to 31 July. The 1986 Institute is the first to focus on computational linguistics. During the intervening week, a number of special courses will be held that should be of particular interest to computational linguists. For further information contact D. Terence Langendoen, CUNY Graduate Center, 33 W. 42nd Street, New York, NY 10036; 212:921-9061; tergc%cunyvm@wiscvm.arpa. ------------------------------ From: Werner Uhrig Date: Sat 2 Nov 85 14:13:34-CST Subject: COMPUTERISED ARCHIVES - MUSEUMS ON DISC [ from "The Economist", Oct 26, 85. page 100 ] COMPUTERISED ARCHIVES - MUSEUMS ON DISC Museums and libraries face a dilemma. They wish to preserve their treasures but they must allow the public access to them. The two jobs are oftenincompatible. ... Microfilm provides one answer, but it is inadequate for several reasons. The film itself is as perishable as paper, while the film-reading machines are bulky and expensive. The Smithsonian Institution's Air and Space Museum in Washington (the world's busiest museum, with 12m visitors last year) thinks it has a better way. Mr. Hernan Ottano, head of the Smithsonian's "advanced projects" division, has developed a way of recording digitised images of documents that makes them easy to record and retrieve. It is having a test run on more than 50,000 documents that make up the archives of Wernher von Braun, the rocket pioneer. By the middle of next year, you will be able to walk into the Smithsonian and buy one video disc on which are copies of all von Braun's papers. The Air and Space Museum has already recorded its collection of nearly 1m photographs on video discs using a simple analog system, where the photograph is filmed by a video camera and the image transferred to a video disc .... .... The handling of originals decreased by 50% in a year after the first disc was made available. The discs sell for $30 each. But the Otano project is more ambitious. By turning the images into digital information, you can transmit them over telephone lines and simultaneously index the text. It begins with a digitising video camera, which automatically focuses, adjusts for things like light and paper colour, zooms in or out to captuere the whole document and then makes a black-and-white image consisting of 4m spots, or pixels, digitally stored. Unlike microfilm the image being recorded is displayed on a screen so it can be checked. It is then compressed by a personal computer into 50 kilobytes of imformation per image by ignoring large uniform areas. Mr. Otano's group added software, that in the case of printed documents, can turn any text into an ASCII text string - meaning that a computer can then recognise the words (it can read 2,000 different typefaces). By looking for words, it can then make an automatic index of the text in the documents themselves. ... And, to give an idea of the scale, Encyclopaedia Britannica could go on a double-sided disc, as could the contents of 33 filing cabinets. With a video-disc player and a printer, a museum can have, for a few thousand dollars, a system that can produce indexed, word-perfect and effectively indestructible copies of its collection to which anybody can have access merely by buying a disc. ... The idea is so simple that museum directors the world over will be kicking themselves that they did not think of it (the Smithsonian has applied for a patent). And eventually, enthuses Mr. Otano, colour photographs and paintings as well as three-dimensional objects (fossils, coins) could be photographed and digitally stored in the same way. The gargantuan collections of the great museums - the Smithsonian alanoe holds 100m items - will then be safe. ------------------------------ From: Henry Nussbacher Date: Mon, 21 Oct 85 10:49 EDT Subject: Ir-List now abstracted into Internetwork Database server As per a recent announcement in Ir-List, the Ir-List digest has now been added to the Database server at node Bitnic in Bitnet. Refer back to Issue #15 (listed as September 19th - but should have been October 19th) for further details. Hank ------------------------------ END OF IRList Digest ********************