Date: Mon, 11 Aug 86 18:30:26 edt From: vtisr1!irlistrq To: fox Subject: IRList Digest V2 #35 Status: R IRList Digest Monday, 11 August 1986 Volume 2 : Issue 35 Today's Topics: Report - File Format for Machine Readable Webster's 7th Coll. Dict. ---------------------------------------------------------------------- Date: Fri, 1 Aug 86 11:28:01 CDT From: James Peterson Subject: W7 file format Since you asked about my report, I am including below the introductory part (not the appendix). It seems too long for IRList, but as editor, that is your decision. You may want to cut it even more. jim [Note: the report below is unedited - it seems about right for an issue of IRList. Readers not interested in this topic need not read any more than they wish. - Ed] Webster's Seventh New Collegiate Dictionary A Computer-Readable File Format James L. Peterson Microelectronics and Computer Technology Corporation (MCC) Austin, Texas 78759 Section 1. Introduction A transcript of Webster's Seventh New Collegiate Dictionary [1] is available in a computer-readable form. This is not just a word list, but a copy of the entire dictionary including definitions, cross references, variants, synonyms, and so on. It consists of some 15,696,929 characters, with 68,764 main entries. It could be used for all forms of text processing, including spelling, hyphenation, syntax, semantics, and so on. The original dictionary was keyboarded onto the Q-32 computer at System Development Corporation (SDC) for a project headed by John Olney [2]. The dictionary was then heavily edited and moved onto an IBM 360. Magnetic tapes of this form were moved to IBM T. J. Watson Research Center and further processed by C. Alberga. A copy of this was acquired by Robert Amsler, who used the Pocket Dictionary for his dissertation [3]. We have acquired a copy of the dictionary from Amsler, and have modified it in many minor ways. This document describes our version. The dictionary is such a large collection of text that it is broken up into 220 files for easier handling. These files reside under the names d.101, d.102, d.103, ..., d.320. The file names were selected to require three digits in all cases. The first letter of each word in a file is the same; that is when we switch from words starting with 'D' to words starting with 'E', we start a new file. Otherwise, files are broken to create roughly equal-sized files (from 70,000 to 80,000 characters). Here is a sample of a dictionary file. F;chase;1;;;vb;; P;'ch{a-}s E;ME [italic chasen], fr. MF [italic chasser], fr. (assumed) VL [italic# captiare] -- more at [mini CATCH] D;1;a;;vt;to follow rapidly : [mini PURSUE] D;1;b;;vt;[mini HUNT] D;1;c;;vt;to follow regularly or persistently with the intention of# attracting or alluring L;2;;;[italic obs] D;2;;;vt;[mini HARASS] D;3;;;vt;to seek out D;4;a;;vt;to cause to depart or flee : [mini DRIVE] L;4;b;;[italic slang] D;4;b;;vt;to take (oneself) off D;1;;;vi;to chase an animal, person, or thing D;2;;;vi;[mini RUSH], [mini HASTEN] S;0;[mini PURSUE], [mini FOLLOW], [mini TRAIL]: S;1;[mini CHASE] implies going swiftly after and trying to overtake something# fleeing or running; S;2;[mini PURSUE] suggests a continuing effort to overtake, reach, attain; S;3;[mini FOLLOW] puts less emphasis upon speed or intent to overtake and may# not imply an awareness on the part of the leader that he is pursued; S;4;[mini TRAIL] may stress a following of tracks or traces rather than a# visible object F;chase;2;;;n;; D;1;a;;n;the act of chasing : [mini PURSUIT] D;1;b;;n;[mini HUNTING] -- used with [italic the] D;1;c;;n;an earnest or frenzied seeking after something desired D;2;;;n;something pursued D;3;a;;n;a franchise to hunt within certain limits of land D;3;b;;n;a tract of unenclosed land used as a game preserve Each line of the file has a character in column 1 which identifies the type and format of the line. The following table shows the number and meaning of each line type. Frequency Line type Meaning ________________________________________________________ 68,764 F First line, start for a new word 30,673 E Etymology 66,987 P Pronunciation 9,959 V Variant 140,501 D Definition, one per line 19,123 R Related word 4,596 X Cross-Reference 11,992 L Label 835 S Synonym block Each line is composed of a number of fields. Fields are separated by a semicolon and are defined by their position. The first field of each line is the line type character (F, E, P, V, D, L, R, X, or S, as given above). The remaining fields depend upon the type of the line. For example, the second entry on an F-line is a main entry word, the fifth field has hyphenation information, and the seventh has part of speech information. The following table shows the contents of the fields for each line type. F lines - Main entry F1 - F F2 - Main entry F3 - Homograph Number F4 - Prefix/Suffix/Infix F5 - Hyphenation F6 - Part of Speech F7 - Part of Speech Joiner F8 - Secondary Part of Speech E lines - Etymology E1 - E E2 - text P lines - Pronunciation P1 - P P2 - text V lines - Variants V1 - V V2 - Variant Word V3 - Hyphenation V4 - Variant Level D lines - Definitions D1 - D D2 - Sense number D3 - Sense letter D4 - Sense subnumber D5 - Part of Speech D6 - Text definition R cards - Related Words R1 - R R2 - Related Word or Phrase R3 - Hyphenation R4 - Part of Speech R5 - Part of Speech Joiner R6 - Secondary Part of Speech X lines - Cross Reference X1 - X X2 - Word X3 - superscript, if any X4 - subscript, if any X5 - Type of cross reference X6 - Secondary word L lines - Labels L1 - L L2 - Sense Number L3 - Sense Letter L4 - Sense Subnumber L5 - Label Text S lines - Synonym Block S1 - S S2 - Synonym number S3 - text Appendix I discusses each line and field in more detail. Some lines, particularly definitions and synonym blocks are quite long and hence it is difficult to fit them on one 80 character line. Therefore, lines are split whenever necessary so that no physical line exceeds 80 characters. Lines are always split at a blank, and the incomplete line is terminated with a sharp or hash mark character ('#'). When processing the dictionary, if a line is terminated with a # character, then the # character should be replaced by a blank and the next physical line should be read in and appended to the previous line. 1.1 Character codes A major problem with the dictionary is its character set. First, the dictionary publisher did not feel constrained in his use of characters, but choose whatever symbols best fit his purpose. Second, the dictionary was originally encoded in an extended BCD (for the Q-32 computer), then translated into EBCDIC (for the IBM 360/370) and now has been translated into ASCII. None of these character sets is completely compatible with the others, nor is any of them sufficient to represent all of the variation found in the original printed dictionary. Hence an encoding scheme must be used to expand the set of representable characters. This expansion occurs in two independent directions: font information and special characters. Font information is represented by use of the square brackets in ASCII to surround any special font material. Five font types are recognized: (1) italic, (2) MINI-CAPS, (3) bold, (4) subscripts, and (5) superscripts. Each is denoted by an identifying keyword immediately after the opening left square bracket, followed by a space, followed by the material to be in the defined font, followed by the closing right square bracket. For example, an italic was is represented as [italic was], while a mini-caps AMBIENT is [mini AMBIENT] and a bold syn is [bold syn]. Superscripts and subscripts may be italic, mini-caps, or bold, and a few superscripted superscripts also occur, as in 6.24 {times} 10 [sup 10 [sup 10]]. The dictionary includes a large number of special symbols which are not representable in ASCII. These include all the Greek alphabet, the Hebrew alphabet and many miscellaneous other special symbols. All special symbols which are not available in ASCII (and some that are) have been given names which are represented by the name encased in braces, as {degrees} (for a degree symbol) {times} (for multiplication represented by a small x), {tau} (for the lower-case greek letter tau), and so on. A complete list of these special symbols is in Appendix II. Each symbol name has been selected to exclude embedded blanks. Thus all characters between an opening right brace and its closing right brace are non-blank. Certain characters which are in ASCII (braces, brackets, question mark, exclamation mark, and so on) have also been represented in this way because it allowed them to be used for other purposes (such as font and special character representation), and they occurred only infrequently (less than 100 times). Section Errors in W7. While processing W7 both to understand its contents and to put those contents into a usable form, we encountered a large number of errors. These errors were of several types: * Merged illustrations. For example, under false, the illustration was "< ~ documents ~ teeth >" and should have been "< ~ documents > < ~ teeth >". To correct this, we searched for any line of the form "<...~...~...>". * Words containing letters with accents (236 entries). The accent field was wrong about half the time. The normal problem was that the accent was on the wrong letter. In these cases, the hyphenation information generally showed syllables that were two letters two long. * Incorrect values in fields. The lists in Appendix I were examined for rare or inappropriate values; for example, a g in a numeric field, or a zero in an alphabetic field. * Mismatched parentheses or brackets. We wrote a program to simply count parentheses, braces, and brackets. Many were found to be mismatched. * Duplicate words. We scanned the text for instances where the same word occurs twice in a row. The assumption was that these would be places where the last word on a line of the original input was typed twice by mistake. Instead we found a large number of places where of or was typed as or or, as a was typed as a a, and the first word on an input line was typed twice. * Incorrect article. We found all occurrences of a followed by a word starting with a vowel, or an followed by a consonant. All these errors, once found and verified by visual inspection of the printed dictionary, were corrected by hand, using a text editor. A last form of error analysis was an attempt to find typographical errors. The approach was simple: we extracted a list of all unique words used in the dictionary definitions. This produced a list of 54,298 words. We compared this list with the list of all words defined in the dictionary (main entries, variants, and related words). This reduced our list to 20,292 words that were used in definitions, but not defined. Many of these were derived forms of defined words: past tense, plurals, and so on. Doing some simple suffix analysis, we were left with about 8,000 words. Most of these were apparently Greek or Latin botanical or zoological names. Deleting those ending in -ia or -ae and all words in italics in the dictionary left a list of 2,821 words. These were checked by hand to produce a list of 903 incorrectly spelled words. We also found 54 words which were used, but not defined: Australasian bloomery broadheaded clothesline crossbeam darkskinned dinnerware doorbell entranceway equivalve etc fieldworks flattish foreseen foretold fourpence gunstock hairdress hindlimbs homeward hyperactivity leftward lightcolored longlegged lowcut messroom Mr. nailhead neckband noctuid nubby parimutuel partaken pregenital pyrotechny rangeland rosebush sawteeth seneschal sheepdogs shorthaired sightsee snowstorm songbook spinymargined spondumene sulphates supersensitized TV twelvefold understock upcurved valency workpiece. We also found a smaller list of words with typographical errors in the main entry in the computer files. Of the 903 typographical errors, 543 were the result of a missing blank between two words. Of the remaining 360, 34% were a missing letter, 27% were a wrong letter, 20% were an extra letter, and 13% were the result of transposed letters. The remaining errors were caused by two extra or two missing letters, or by transposing two letters around a third. The middle letter in this case was always a vowel. (For example, 'min' would be typed 'nim'.) We also found 10 cases of typographical errors in the original printed dictionary. These are [barranca] gulley => gully [bitch] doublecross => double-cross [capsicum] genu => genus [drift] quantitive => quantitative [fornication] NCE => New Catholic Edition [kid] goodhumored => good-humored [lapse] apostasize => apostatize [lycopodium] clubmosses => club mosses [type species] permanenty => permanently [vanity] knicknack => knickknack [select] mismatched parenthesis [terpineol] mismatched parenthesis [tapa] an Hawaiian (but see 'a Hawaiian' for [luau] and [poi]) [pay] REQUITE in syn not defined [rude] ROUGH in syn not defined [Lag b'Omer] 33d => 33rd [one] 3d-person => 3rd-person It was interesting to follow these errors through the various printings and editions of the Merriam-Webster dictionaries. Four errors were corrected in the 1970 printing of W7, one in the 1971 printing of W7, and one in the 1975 printing of the New Collegiate Dictionary. Three errors remain in the 1983 printing of the Ninth New Collegiate ([bitch], [drift], [vanity]). ------------------------------ END OF IRList Digest ********************