Date: Tue, 31 Dec 85 19:09 EST To: irdis at vpi Subject: IRList Digest V1 #28 IRList Digest Tuesday, 31 Dec 1985 Volume 1 : Issue 28 Today's Topics: Email - Up and Down Article (long) - Report on Waterloo Conference on OED ---------------------------------------------------------------------- From: FOX 31-DEC-1985 19:02 Subj: E-Mail, Machines, and New Year Our machines are up again and so CSNET mail should begin flowing. Please resend any messages that were returned to you over the period since 19 December. I will process mail sent during that time as soon as possible. This machine may go down again on Saturday, but we hope to have an alternate CSNET link up by then. Happy New Year! - Ed Fox ------------------------------ From: "Michael Lesk at petrus.UUCP" Date: Wed, 18 Dec 85 21:41:34 est Message-Id: <8512190241.AA01618@petrus.UUCP> Ed, here is the writeup of the Waterloo conference for the SIGIR newsletter ... Mike [Note: the original had all sorts of fine use of troff and -ms macro calls, which were removed for readibility here. The typeset form should appear in an upcoming issue of ACM SIGIR Forum. - Ed] Information in Data: Using the Oxford English Dictionary on a Computer November 13, 1985 The recorded history of the English language, the OED, is going to be available in machine-readable form in a few years. How should it be arranged, and what will be it used for? The University of Waterloo's Centre for the New OED is studying these questions, and the Centre ran its first conference on Nov. 7-8, 1985, entitled ``Information in Data.'' The conference was attended by about 95 people, split between linguists, English literature experts, and computer types. Major issues are: should the dictionary be delivered as a text stream, with some associated indexing, or as a database, in which some of the fields are character strings? What are the major applications for which the computer form of the OED will be used? How will be it be used by computer people, and how by linguists? Understandably, but perhaps regrettably, the conference contained a large number of talks in which people described what else they would like to have in a dictionary. The OED is not intended as the only reference work a library needs to acquire, and the most rewarding talks were those that described the richness of the OED and the complexity of this project, rather than those that asked for still more data. The OED describes the history and the potential of the English language, but leaves the statistics of current use and the ephem- era of current names to others. There should be quite enough in it for many interesting projects. The conference opened Thursday evening with a welcome from Tom Brzus- towski of the University of Waterloo, which has strongly supported this pro- ject, thanks to the personal interest of the President, Douglas Wright. John Simpson (New Words editor of the OED) gave the main introductory talk. He described the goals of the Supplement: to broaden coverage both geographically and by types of source work, as well as to bring the dictionary up to the 1980s. As an example of the broader coverage of the new OED, the final volume of the Supplement will contain wysiwyg (non-hackers may not recognize this as a kind of text formatting program, an acronym for ``what you see is what you get''). Once finished, the Supplement is to be merged with the old volumes, pro- ducing a 15-20 volume cumulated work. To accomplish this, the entire diction- ary is being converted to machine readable form, and it is intended to update the machine-readable version continuously. At present Waterloo has on line only the letter M, which was done first. The remainder of the keystroking, which is being done in alphabetical order, is now up to the letter I, with proofreading and corrections done through the letter E. All input will be finished in mid 1986, and the merging process completed later that year. The size of the OED project can best be appreciated from the phrase used by John Simpson to describe recent activity: ``the headlong rush to complete the Sup- plement over the last ten years.'' Many have asked why the dictionary was not scanned, rather than rekeyed. Although it is now possible to scan the OED, it is not simple, and in addition it was desirable to label the information in the dictionary at least with typesetting codes. Adding these codes could not be done mechanically, and the work of doing this during OCR scanning makes total rekeying as simple and cheap as scanning. The full dictionary will contain about 350M characters, and with format information added the whole computer file will be about 500 Mbyte. OUP intends to make the dictionary available in machine-readable form at a quite reasonable price for researchers, and is contemplating CD-ROM as the medium. John Stubbs described Waterloo's role: they have a Centre, and offer visiting scholar positions plus part time student help. Prof. Stubbs (of the History department) and Prof. Tompa (of CS) are co-directors of the Centre. Waterloo is designing a database structure for the OED, collecting information on the kinds of questions linguists ask of the OED, and helping Oxford format the typesetting tapes for assistance in printing the merged new edition of the dictionary. J. Howard Johnson of the Waterloo CS department wrote a new translator from regular grammars to finite state machines, which has been used to analyze the typesetting format produced by the keystrokers and translate it into a more readable and more easily processed representation. Gaston Gonnet (also a Waterloo CS professor) recently provided a new query language, GOEDEL, which can be used to ask questions of the dictionary file. The Centre is accumulating queries linguists present to the OED, to see what other software tools are needed to answer them. Friday morning, the first session dealt with uses of on-line dic- tionaries. The speakers generally wanted more information in dictionaries: semantic markers and frequency data. For modern texts, some of this can be accumulated automatically; but it is not clear how an historical dictionary such as the OED is to determine the relative frequencies of word sense in ear- lier centuries. Prof. Henry Kucera of Brown University reminded us of the flood of spel- ling checkers, including one that accepted unbearing having stripped off both the -ing and the un-. He also discussed the use of dictionaries in parsing, and the need to do more than local disambiguation of noun/verb ambiguities, quoting some familiar garden-path sentences (Can covers thrown overboard from Russian submarines have been recovered). Most of his talk, however, urged that dictionaries give subentries of a given word in order of frequency of use. Of the 5000 most frequent verbs in English, 3000 can also be nouns. Data from the Brown corpus, for example, show that spring is more often a noun (the season) than a verb; yet most dictionaries list the verbal sense first. Other words often given in the wrong order include: address, aid, attempt (noun more common) and affect, bet (verb more common). Donald Walker of Bell Communications Research followed, reviewing the use of reference works to help analyze the text of news stories. A comparison of the New York Times with the Merriam-Webster 7th New Collegiate shows that 2/3 of the words in the dictionary are not used in the Times, probably not surprising, but also that about 2/3 of the word forms in the Times are not in the dictionary! Many of those are inflected forms, but a large number are proper nouns, hyphenated forms, and incorrect spellings. Work is proceeding on finding a list of proper nouns; 42,000 isolated from the Times were com- pared with 46,000 isolated from the World Almanac and 4,000 were in common. In addition, a list of 260,000 noun phrases has been accumulated from the times, and they are looking to see which should be in the dictionary. Dr. Walker also reminded us of the FORCE4 program, which disambiguates word senses by assuming that the sense coming from the sublanguage domain that predom- inates in the article is probably the right one (work with Bob Amsler). This program required a list of semantic markers on each word sense indicating the subject area with which that sense is used; such a list is in Longman's machine-readable dictionary. Prof. George Miller of Princeton University, who is studying how children learn vocabulary, explained some traditional mistakes children make using words from a dictionary, e.g. the child who wrote My family erodes a lot hav- ing seen erode defined in a dictionary as eat out, eat away, wear away. Chil- dren are good at learning words (they pick up 22 a day from 6 years to 8 years of age), but dictionaries don't help them. Prof. Miller was now building a matrix of words vs. senses; effectively, definitions are listed down the left margin, and words across the top. To get a dictionary, read the matrix columnwise; to get a thesaurus, read it rowwise. He is building synonym sets as well, in which related words such as {abandon, forsake, desert, abdicate} are grouped, and would like a machine-readable source of such data. Other modes of access to words he is exploring are psychological sense relations, such as contrasts, similars, class inclusion, case relations, and part-whole. He has found that the dictionary browsing programs that result from this work are sufficiently enjoyable even to qualify as games, as well as their expected uses in word processing and education. The second session of the conference featured several linguists, and focussed more on the OED itself. These speakers also wanted additional data in the dictionary, but this time they wanted linguistic data about word ori- gins and word uses rather than semantic data. The session started with Gisele Losier's talk on studying loan words in the OED. Most of her talk dealt with the development of words that changed from imperative verbs to nouns with similar meanings, e.g. cease-fire. This has happened in other languages (Latin, Greek and French examples were given) and sometimes the word has come into English already converted, e.g. permit or quibble. Other, more obvious instances are farewell, encore, go-between, and has-been. There is no easy way to find such words automatically, unfor- tunately. Prof. Christopher Dean, of the University of Saskatchewan, then described the problems of compiling an historical dictionary of local dialects. Basi- cally, this is an almost impossible problem. For his special interest, the dialect of Yorkshire, there are no adequate texts before the 17th century. Worse yet, what texts there are tend to represent educated usage, and thus standard English. As a result, an historical dialect dictionary can not be made, and the only hope for tracing the history of a dialect is to work back- wards from present day speech, using phonological rules. If the variations among national versions of English (e.g. Canadian, Australian, etc.) ever reach the status of separate dialects, perhaps the OED could be used to track the history of these; although that much divergence seems unlikely in our era of world-wide communications. The last linguistic speaker was Neil Hultin, who gave an interesting but long description of Murray's attitude towards word development in the context of Victorian thought. Murray assumed that words developed from concrete mean- ings to abstract meanings, and from simple use to complex use, whether or not the historical evidence supported this. For example, ardor is given first as meaning heat and only later as passion even though the latter meaning is three hundred years earlier. When studied, primitive languages have not turned out to consist of monosyllables representing only concrete objects, as the Vic- torians believed. The OED, despite its protestations of experimental basis, is following a 19th-century view of progress when it arranges words in a ``logical'' order that goes from concrete and simple to abstract and complex. As an aside of intellectual history, the reverse assumption is now com- mon: spider webs were commonly assumed by modern scientists to have evolved from the messy and irregular forms to the neat and elegant orb webs with radial and concentric strands. Now evidence is showing that the simple and elegant forms are the older, and the irregular forms are the newer, which con- founds our twentieth century expectations. The last session was devoted to artificial intelligence and knowledge bases. Randy Goebel began with a tutorial talk on knowledge representation and the design of expert systems. The talk was rather vague. John Sowa fol- lowed with a list of puzzling phenomena that a lexicon should be able to explain, e.g. recognizing an instrument even when the word is used as an object, so that a pair of utterances such as The janitor opened the door with an old key. and The janitor opened the door. He used an old key. are treated similarly. Other curiousities included the need to understand that although musician is a subclass of man (and of woman), bad musician does not imply bad man. And why can one say former pet but not former cat? Sowa's conceptual graphs, which look like Fillmore case grammar extended, can provide solutions to some of these problems, but not to everything. Exactly what would need to be added to the OED to permit computer disambiguation of such phrases is still a research project. Frank Tompa ended the conference, alluding to the use of the dictionary as a knowledge repository but not really describing how such a database might be built. Most of the discussion revolved around the use of the dictionary, e.g. the possibility of gathering statistics on users and what they did (R. R. K. Hartmann). But the most common suggestion was to arrange the dictionary in some kind of importance hierarchy, so that simpler questions could be answered with an excerpted dictionary. Those more familiar with the dictionary were disturbed by these suggestions; some of the computer scientists in the audi- ence did not seem to realize that Oxford publishes several smaller dic- tionaries, including the Concise, Shorter, Little and Pocket dictionaries, not to mention the line of Learner's dictionaries. These are edited separately; OUP rejects the idea that you can convert an unabridged dictionary to a colle- giate dictionary by crossing out the last half of each definition. Availabil- ity of the OED for research is actually dependent on the special nature of the OED; Oxford considers it unlikely that others will abuse the availability of the OED by trying to publish a similar dictionary, and is being more cir- cumspect about the availability of the Concise dictionary. (The generosity of OUP in making the Learner's dictionary available today, and the OED in the future, however, should be praised by all working computer linguists). A more serious problem with the complexity of the OED is that there are many closely related senses for some words, and for some computer applications it might not be interesting to distinguish these while the more contrasting senses should still be separated. No one addressed this question, however, despite the gen- eral interest in automatic simplification. One hopes at least that the meet- ing taught the computer people a better appreciation of the size and complex- ity of the OED. Once ``to crack a nut with a steam-hammer'' was a standard metaphor for overkill; this conference provided ``to use the OED to spell cat'' (Diana Patterson). Perhaps the most interesting discussion of all took place outside the conference, in an attempt to decide what should be in an electronic database of the OED. Should it be a text stream, with some added indexes, or should it be a data structure, in which some fields happen to be character strings? The text stream is simplest to build, and makes fewer decisions in advance, thus foreclosing the fewest options. A data structure, however, offers the ability to connect (with some kind of pointer structure) citations from the same author, derivations from the same foreign language, and other relationships of interest to the users. Frank Tompa has done considerable work designing such a possible data structure, although it's not implemented yet. There are some places where additional knowledge could profitably be inserted in the OED; for example, work is underway in Edinburgh on a program to convert OED pronuncia- tions to IPA. This discussion may be moot, however, since Oxford needs to be able to regenerate the exact page images from whichever representation is chosen. For example, within the quotations even the end-of-line hyphenation matters, since it may reflect hyphenation practices of an earlier century. Thus any data structure must have a procedure for regenerating the text stream; and it will undoubtedly be made mechanically from the text as it comes from the pages. Either representation could then be chosen for distribution, and sent out along with the programs to convert to the other one. The danger, to me, is that in the process of constructing the data base some information would either be discarded or made more difficult to access. The computer types kept thinking that they could rationalize some of the inconsistencies in the OED, and the lexicographers kept pointing out that these were not inconsistencies, but carefully thought-out choices (e.g. whether an etymological note appears before the sense definitions or between two of them). One Waterloo staffer finally said that ``there are no inconsistencies, only rules we haven't yet discovered.'' I was also left with a strong wish that further progress could be made on text archives. To study the English language, we would like not only the OED in machine-readable form, but also the standard works of English literature. Although some efforts in this direction have been made (e.g. the Oxford Text Archive), much more is needed. Part of the problem is the disagreement between the many scholars who would want a first edition of each work, and the computer types who might be satisfied with modern editing (i.e. twentieth cen- tury spelling, grammar and punctuation, not to mention page layout). It is substantially more expensive to get an early edition into machine readable form, since it probably can not be read by an OCR scanner. One can imagine the OED citations containing cross references to a standard archive; and it would certainly be undesirable to define an OED database in such a way that it could not be extended to reference other lexicographic data files, and also bibliographical, textual, geographical, and biographical files (consider, for example, connecting OED author references to the DNB entries for those writ- ers). To summarize, why is this project so interesting and exciting? As a com- puter scientist, the availability of the OED is of particular importance because of the rise of lexicon-based parsers and natural language understand- ing programs. Twice recently I have had major computational linguistics pro- jects explain to me that their grammars have only a few (less than 5) rules, with nearly all grammatical information stored with the words rather than by using syntactic rules. Yet, comparing even an ordinary desk dictionary to most computer lexicons, it is amazing how much more information there is in a dictionary. The OED, as the largest source of information about words in English, offers the possibility of greatly augmenting our computer lexicons, exactly when such lexicons are believed to be essential for developing pro- grams that understand language. The conference was generally successful. It brought together specialists in English and specialists in computers, without a violent clash of the two cultures, and with mutual benefits to both sides. I look forward to the avai- lability of the full machine-readable OED, and I think it will be of great benefit to both the computer and the linguistic community. - Michael Lesk, Bellcore (bellcore!lesk, lesk@bellcore.csnet) PS: Further information about the Centre for the New OED is available from the administrative director, Gayle Johannesen, 519-885-1211, ext. 6200. Her address is: UW Centre for the New Oxford English Dictionary, 105 Dana Porter Library, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada. ------------------------------ END OF IRList Digest ********************