Date: Fri, 8 Aug 86 18:35:21 edt From: vtisr1!irlistrq To: fox Subject: IRList Digest V2 #33 Status: R IRList Digest Thursday, 7 August 1986 Volume 2 : Issue 33 Today's Topics: Discussion - Machine Readable Collins Dict., Job at Leeds Univ. ---------------------------------------------------------------------- Date: 24-JUL-1986 23:09:36 From: RAHTZ%UK.AC.OXFORD.VAX1@AC.UK Subject: The Machine-Readable Collins English Dictionary, Job at Leeds The Machine-Readable Collins English Summary of work in progress Sebastian Rahtz Department of Computer Studies University of Southampton 1. Introduction: This short document summarizes the responses I had to a letter sent out in June 1986 to all the people who have ordered a tape of Collins English Dictionary from the Oxford Text Archive (my thanks to Lou Burnard for the list of names and addresses). I am grateful to all those who replied to my request for information about how they had decoded the text, and what they were doing with it; since it was apparent that quite a lot of work had been done, and that some were much further on than others, it seemed sensible to send out a summary of the replies. I have either included sections of electronic mail directly, or summarized paper mail. ... 2. Philip Taylor, University of London Date: 1-JUL-1986 11:23:07 From: CHAA006@UK.AC.RHBNC.VAXA I carried some some work on transliterating the dictionary from photo- typesetting codes to a more useable form some years ago, when I first received the tape. I had two objectives:- (1) to provide an online English-language HELP system, using VMS help, for all entries in the dictionary, and (2) to integrate the dictionary into the Dennison spelling checker (which also runs on the VAX). Neither of these projects was 100% successful, but the intermediate results may be of some use to you. (As part of (1), I also implmented the core of the IPA on a Mellordate DT80/1 (VT-100 look-alike), with reasonable success). I should be happy to pass on all the work I have done, provided only that any publications resulting from this work acknowledge the various contributors, and that any further work which you carry out should be equally freely available among the Academic community. Philip Taylor (RHBNC, Univ. of London) [CHAA006@UK.AC.RHBNC.VAXB] 2.1 Pascal programs Here are the more useful files from my work on the Collins English Dictionary; they are written in Pascal and Macro-32. The programs TYPESET, DECRYPT and PARSE are good starting points. TYPESET, as is, will produce quite acceptable output even on unmodified VT100s; if you have any DT80/1s, I can copy the IPA ROM for you, and the output will then be as close to Collins type- set form as I was able to achieve within the time available. If you have no DT80s, I could let you have the IPA in 8*8 dot-matrix form, and you could burn it into ROMs for whatever devices you do have. 3. Ian Ellis, University of New England From: ian%oz.neumann@oz.munnari 3-JUL-1986 03:41 Date: Thu, 3 Jul 86 11:30:10 est Thank you for your letter regarding CED. As yet no one on this Campus has tried to use CED other than a list of words. We did try to figure some of the symbols and produce a database but lack of user pressure has allowed us to put it on the back burner Ian Ellis, Director, Computer Centre, University of New England 4. Edward Fox, Virginia Tech From: vtisr1!fox@gov.css.seismo 3-JUL-1986 08:06 You have hit the jackpot! I have worked with several students during the last year on the Collins English Dictionary. One completed his M.S. project specifically on this. We are almost done with production of a database, that can be used from Prolog or from any relational database system, and probably modified for other systems. I hope to be sending a tape to Oxford Text Archive by the end of August. Ed Fox (BITNET[cheapest]:foxea@vtvax3 or foxea%vtvax3.bitnet@wiscvm.arpa; CSNET:fox@vt;Internet:fox%vtisr1.uucp@seismo.css.gov;UUCP:seismo!vtisr1!fox) Dr. Edward A. Fox; Dept. of Computer Science; 562 McBryde Hall Virginia Tech, Blacksburg VA 24061; (703) 961-5113 or 6931 We have done everything EXCEPT for the phonetic and etymology information - I hope you don't need them! All I have so far is the MS report - ... 5. David Eckersley, University of Salford Date: MON, 07 JUL 86 13:57:38 GMT From: D_ECK@UK.AC.SALFORD.SYSC University of Salford Computing Services: Dr J B Slater, Director I reply to your letter of June 24th concerning the Collins English Dictionary from the Oxford Text Archive. I'm afraid we have for the time being shelved our plans for using this data. The person who was do to the work left us, and I have not taken it up. We did not manage to attach any consistent meanings to the embedded codes in the text. D Eckersley (Secretary, IUSC) 6. Eric Atwell, University of Leeds From: E S Atwell [eric@uk.ac.leeds.ai] Date: Tue, 15 Jul 86 13:26:11 bst I'm afraid I haven't done anything of use to you with the CED tape: I got it mainly to evaluate it and compare it to the machine-readable versions of two other dictionaries, the Oxford Advanced Learner's Dictionary (OALD) and the Longman Dictionary of Contemporary English (LDOCE). I am researching into aspects of parsing and grammatical analysis of unresticted `raw' text, for which a large non-`toy' dictionary is required. Each word in the dictionary needs detailed grammatical information; and the grammatical codes used in OALD and LDOCE are far more refined and detailed than those of CED, so I have concentrated work on the other two. In fact, LDOCE has already been converted into a database-type format, and this form is available for general (including commercial) research, though at a price - at the Alvey workshop on linguistic theory and computer applications at UMIST last september, a figure of pounds 30,000 was mentioned! As an alternative, I have a copy of the OALD tape, and last year I got one of our undergraduates to attempt a reformatting of this as a Third Year Project. Unfortunately, he did not get as far as a form worthy of general distribution, but after graduating he stayed on here over the summer to finish parsing the original file; the end result is exemplified by the sample at the end of this letter. I am currently trying to get some funding from OUP to carry this work further (in collaboration with Prof Sampson of the linguistics dept. and Tony Cowie from our English dept.) However, if you are committed to using the CED, I suggest you get in touch with the Speech research group at IBM Scientific Centre in Winchester; they have extracted a quarter-million wordlist from CED I believe, with grammatical part-of-speech and phonetic transcription codes (but with other fields ignored); the CED phonetic transcriptions are, they say, better than those of OALD or LDOCE, which is why they are 'out on a limb' in the sense that most other researchers i know of are using OALD or LDOCE. Eric Steven Atwell Artificial Intelligence Group Department of Computer Studies phone: +44 532 431751 ext 6307/6119 Leeds University JANET: eric@uk.ac.leeds.ai Leeds LS2 9JT UUCP: ...!seismo!mcvax!ai.leeds.ac.uk!eric England EARN/BITNET/ARPA: eric%uk.ac.leeds.ai@rl.earn EXAMPLE OF PARSED REFORMATTED OALD FILE: headword :B alternative spelling of headword :b pronunciation :bi +++++++start of pieces+++++++ conjugation or plural label :pl conjugation or plural spelling :B's conjugation or plural spelling :b's pronunciation :biz __________definition__________ text :the second letter of the English alphabet. **********end of entry********** headword :baa pronunciation :bq +++++++start of pieces+++++++ word class label :n __________definition__________ text :cry of a sheep or lamb. ***change in part of speech*** word class label :vi text :(baaing, baaed or baa'd /bqd/) make this cry; bleat. ====subentry==== derivative :%@-lamb word class label :n ---subentry definition--- text :child's word for a sheep or lamb. **********end of entry********** headword :baas pronunciation :bqs +++++++start of pieces+++++++ word class label :n __________definition__________ text :(S Africa) boss. **********end of entry********** headword :babble pronunciation :%babl +++++++start of pieces+++++++ word class label :vi word class label :vt __________definition__________ verb pattern :2A verb pattern :2B verb pattern :2C text :talk in a way that is difficult to understand; make sounds like a b __________definition__________ verb pattern :6A verb pattern :15B ====subentry==== idiom :@ (out) text :, repeat foolishly; tell (a secret): @ (out) nonsense/secrets. ***change in part of speech*** word class label :n nountype :U text :childish or foolish talk; confused talk not clearly to be understoo __________definition__________ text :gentle sound of water flowing over stones, etc. ====subentry==== derivative :bab.bler pronunciation :%bablE(r) word class label :n ---subentry definition--- text :person who @s, esp one who tells secrets. **********end of entry********** headword :babe pronunciation :beIb +++++++start of pieces+++++++ word class label :n __________definition__________ text :(liter) baby. __________definition__________ text :inexperienced and easily deceived person. __________definition__________ text :(US sl) girl or young woman. **********end of entry********** headword :babel pronunciation :%beIbl +++++++start of pieces+++++++ word class label :n __________definition__________ text :the Tower of B@, tower built to reach heaven. (Gen 11). __________definition__________ text :(sing with indef art) scene of noisy and confused talking: What a @ **********end of entry********** headword :ba.boo alternative spelling of headword :babu pronunciation :%bqbu +++++++start of pieces+++++++ word class label :n __________definition__________ text :(as Hindu title) Mr; Hindu gentleman; Hindu clerk; (old use, pej) H **********end of entry********** headword :ba.boon pronunciation :bE%bun US pronunciation :ba- +++++++start of pieces+++++++ word class label :n __________definition__________ text :large monkey (of Africa and southern Asia) with a dog-like face. cross reference :the illus at ape **********end of entry********** headword :baby pronunciation :%beIbI +++++++start of pieces+++++++ word class label :n conjugation or plural label :pl conjugation or plural spelling :-bies __________definition__________ 6.1 Further remarks It will be interesting to see what others are doing with CED and other dictionary tapes, so please do circulate your findings. You may like to join Euralex, the European association for lexicography, and find other related work through their bulletin (I assume you are not already a member as your name did not appear on the recent membership list). For details contact RRK Hartmann, Language Centre, Exeter University, Exeter EX4 4QH (no JANET address that I know of!) I would also like to hear how your 3rd year project student gets on. TEFL students might prefer a ``browser aid" for LDOCE or OALD, as these as specifically designed for 2nd language learners; in my previous job at Lancaster University, I wrote a browser aid for the LDOCE which ELT MA students could use. The speaking CED sounds a great idea. A major problem with `off-the-shelf' speech synthesisers is that they have no way of producing varied ``listenable" intonation contours for sentences and longer texts; but this problem is neatly sidestepped in a talking dictionary, as most fields (keyword, part of speech, spelling)do not require smooth continuous speech, and the definition fields tend to be short sentences or sentence-fragments where a very simple intonation contour would be quite acceptable to the user. Even so, as you suggest, it is still quite ambitious for a third year project! 6.2 an interesting job From: E S Atwell [eric@uk.ac.leeds.ai] 22-JUL-1986 16:10 Subj: vacancy for NLP/AI/OR Software Engineer I am collaborating with Professor Sampson on a Parsing research project, and we have just had the go-ahead to advertise for a software engineer to work with us on the project. I would be most grateful if you could bring the following details to the attention of any potential candidates you know of. ********* UNIVERSITY OF LEEDS ****** ANNEALING PARSER PROJECT ********* Applications are invited for a post of SOFTWARE ENGINEER, to work on a project developing a parser for unrestricted English using the connexionist technique of simulated annealing. The project (funded by the Joint Speech Research Unit) is supervised by Prof. Geoffrey Sampson of the Linguistics & Phonetics Department (where the post will be tenable) and Eric Atwell of the Computer Studies Department. The person appointed will be working on a SUN-3/52M Workstation dedicated to his/her use. Candidates should have a good honours degree; experience with natural language analysis, and of programming in a Unix environment, will be advantages. The post is available from 1 October 1986 for a fixed term of up to 3 years. Starting salary will be within the range 8020 to 9495 pounds (under rev Other-Related IA Grade, according to age, qualifications, and experience. Informal enquiries may be made to Prof. Sampson on (0532) 431751 ext.6252; or by electronic mail to Eric Atwell, eric@Leeds.AI via JANET or eric%UK.ac.Leeds.AI@RL.EARN via EARN or BITNET. For application forms and further particulars write to the Registrar, The University, Leeds LS2 9JT, quoting reference no. 14/20. ****** The closing date for applications is 14 AUGUST 1986 ****** Leeds University is one of the largest and most influential universities in the country. Leeds itself is the commercial, social and sporting centre for much of North and West Yorkshire; it has all the facilities you would expect of a major city, yet the outskirts of Leeds lead directly out onto 2,00 square miles of outstandingly beautiful countryside. Leeds also offers some of the cheapest housing in England; for example, pounds 15,000 buys a two-bedroomed se or a larger terraced house. Simulated Annealing, a technique originating in statistical mechanics, can be used in operational research and artificial intelligence in optimisation problems requiring an efficient search of a very large search space. We plan to apply this technique to parsing unrestricted English, where the search space is a set of trees. The appointee will find a stimulating research environment at Leeds: the University is a thriving centre for research in Computer Analysis of Language and Speech, Artificial Intelligence, Operational Research and related areas. In addition to her/his dedicated workstation, the appointee will have access to a wide range of equipment and software, including specialist Departmental libraries, a VAX 11/750 dedicated to Artificial Intelligence research, and a spacious SUN LOUNGE with a network of Suns, fileserver, laserprinter, and large south-facing windows. 7. Ron Hardie, Brighton Polytechnic [summary of letter] has only just started thinking about CED; interested to hear what others are doing. Ron Hardie, Department of Modern Languages, Brighton Polytechnic, Brighton BN1 9PH 8. Herbert Wenzel, Erlangen [summary of letter] writing text retrieval system for PC, now integrating dictionary, but has problems physically reading tape (sent suggestion). Professor H. Wenzel, Institut fEuLr Technische Chemie II, Egerlandstr. 3, Erlangen, W. Germany. 9. Roger Mitton, University of London [summary of letter] looked at Collins dictionary but finds the Oxford Advanced Learner's more useful. Has produced a database from the OALDCE as part of research into spelling checking, which is available from the Oxford Text Archive. Roger Mitton, Dept Computer Science, Birkbeck College, Malet Street, London WC1E 7HX ------------------------------ END OF IRList Digest ********************