Date: Mon, 11 Aug 86 18:30:26 edt
From: vtisr1!irlistrq
To: fox
Subject: IRList Digest V2 #35
Status: R

IRList Digest           Monday, 11 August 1986      Volume 2 : Issue 35

Today's Topics:
   Report - File Format for Machine Readable Webster's 7th Coll.  Dict.

----------------------------------------------------------------------


Date: Fri, 1 Aug 86 11:28:01 CDT
From: James Peterson <seismo!mcc.com!peterson>
Subject: W7 file format

Since you asked about my report, I am including below the introductory
part (not the appendix).  It seems too long for IRList, but as editor,
that is your decision.  You may want to cut it even more.  jim

[Note: the report below is unedited - it seems about right for an
 issue of IRList.  Readers not interested in this topic need not read
 any more than they wish. - Ed]


        Webster's Seventh New Collegiate Dictionary
              A Computer-Readable File Format


                     James L. Peterson
 Microelectronics and Computer Technology Corporation (MCC)
                    Austin, Texas 78759


Section 1. Introduction

A transcript of Webster's Seventh New Collegiate Dictionary [1] is
available in a computer-readable form. This is not just a word list,
but a copy of the entire dictionary including definitions, cross
references, variants, synonyms, and so on. It consists of some
15,696,929 characters, with 68,764 main entries. It could be used for
all forms of text processing, including spelling, hyphenation, syntax,
semantics, and so on.

  The original dictionary was keyboarded onto the Q-32 computer at
System Development Corporation (SDC) for a project headed by John
Olney [2]. The dictionary was then heavily edited and moved onto an
IBM 360. Magnetic tapes of this form were moved to IBM T. J. Watson
Research Center and further processed by C. Alberga. A copy of this
was acquired by Robert Amsler, who used the Pocket Dictionary for his
dissertation [3]. We have acquired a copy of the dictionary from
Amsler, and have modified it in many minor ways. This document
describes our version.

  The dictionary is such a large collection of text that it is broken
up into 220 files for easier handling. These files reside under the
names d.101, d.102, d.103, ..., d.320. The file names were selected to
require three digits in all cases.

The first letter of each word in a file is the same; that is when
we switch from words starting with 'D' to words starting with 'E',
we start a new file. Otherwise, files are broken to create roughly
equal-sized files (from 70,000 to 80,000 characters).

  Here is a sample of a dictionary file.

F;chase;1;;;vb;;
P;'ch{a-}s
E;ME [italic chasen], fr. MF [italic chasser], fr. (assumed) VL [italic#
captiare] -- more at [mini CATCH]
D;1;a;;vt;to follow rapidly : [mini PURSUE]
D;1;b;;vt;[mini HUNT]
D;1;c;;vt;to follow regularly or persistently with the intention of#
attracting or alluring
L;2;;;[italic obs]
D;2;;;vt;[mini HARASS]
D;3;;;vt;to seek out
D;4;a;;vt;to cause to depart or flee : [mini DRIVE]
L;4;b;;[italic slang]
D;4;b;;vt;to take (oneself) off
D;1;;;vi;to chase an animal, person, or thing
D;2;;;vi;[mini RUSH], [mini HASTEN]
S;0;[mini PURSUE], [mini FOLLOW], [mini TRAIL]:
S;1;[mini CHASE] implies going swiftly after and trying to overtake something#
fleeing or running;
S;2;[mini PURSUE] suggests a continuing effort to overtake, reach, attain;
S;3;[mini FOLLOW] puts less emphasis upon speed or intent to overtake and may#
not imply an awareness on the part of the leader that he is pursued;
S;4;[mini TRAIL] may stress a following of tracks or traces rather than a#
visible object
F;chase;2;;;n;;
D;1;a;;n;the act of chasing : [mini PURSUIT]
D;1;b;;n;[mini HUNTING] -- used with [italic the]
D;1;c;;n;an earnest or frenzied seeking after something desired
D;2;;;n;something pursued
D;3;a;;n;a franchise to hunt within certain limits of land
D;3;b;;n;a tract of unenclosed land used as a game preserve


Each line of the file has a character in column 1 which identifies
the type and format of the line. The following table shows the
number and meaning of each line type.

       Frequency   Line type   Meaning
       ________________________________________________________
          68,764   F           First line, start for a new word
          30,673   E           Etymology
          66,987   P           Pronunciation
           9,959   V           Variant
         140,501   D           Definition, one per line
          19,123   R           Related word
           4,596   X           Cross-Reference
          11,992   L           Label
             835   S           Synonym block

Each line is composed of a number of fields. Fields are separated by a
semicolon and are defined by their position. The first field of each
line is the line type character (F, E, P, V, D, L, R, X, or S, as
given above). The remaining fields depend upon the type of the line.
For example, the second entry on an F-line is a main entry word, the
fifth field has hyphenation information, and the seventh has part of
speech information.

The following table shows the contents of the fields for each line type.

	F lines - Main entry
	 F1 - F
	 F2 - Main entry
	 F3 - Homograph Number
	 F4 - Prefix/Suffix/Infix
	 F5 - Hyphenation
	 F6 - Part of Speech
	 F7 - Part of Speech Joiner
	 F8 - Secondary Part of Speech

	E lines - Etymology
	 E1 - E
	 E2 - text

	P lines - Pronunciation
	 P1 - P
	 P2 - text

	V lines - Variants
	 V1 - V
	 V2 - Variant Word
	 V3 - Hyphenation
	 V4 - Variant Level

	D lines - Definitions
	 D1 - D
	 D2 - Sense number
	 D3 - Sense letter
	 D4 - Sense subnumber
	 D5 - Part of Speech
	 D6 - Text definition

	R cards - Related Words
	 R1 - R
	 R2 - Related Word or Phrase
	 R3 - Hyphenation
	 R4 - Part of Speech
	 R5 - Part of Speech Joiner
	 R6 - Secondary Part of Speech

	X lines - Cross Reference
	 X1 - X
	 X2 - Word
	 X3 - superscript, if any
	 X4 - subscript, if any
	 X5 - Type of cross reference
	 X6 - Secondary word

	L lines - Labels
	 L1 - L
	 L2 - Sense Number
	 L3 - Sense Letter
	 L4 - Sense Subnumber
	 L5 - Label Text

	S lines - Synonym Block
	 S1 - S
	 S2 - Synonym number
	 S3 - text

Appendix I discusses each line and field in more detail.

  Some lines, particularly definitions and synonym blocks are quite
long and hence it is difficult to fit them on one 80 character line.
Therefore, lines are split whenever necessary so that no physical line
exceeds 80 characters. Lines are always split at a blank, and the
incomplete line is terminated with a sharp or hash mark character
('#'). When processing the dictionary, if a line is terminated with a
# character, then the # character should be replaced by a blank and
the next physical line should be read in and appended to the previous
line.

1.1 Character codes

A major problem with the dictionary is its character set. First, the
dictionary publisher did not feel constrained in his use of
characters, but choose whatever symbols best fit his purpose. Second,
the dictionary was originally encoded in an extended BCD (for the Q-32
computer), then translated into EBCDIC (for the IBM 360/370) and now
has been translated into ASCII. None of these character sets is
completely compatible with the others, nor is any of them sufficient
to represent all of the variation found in the original printed
dictionary. Hence an encoding scheme must be used to expand the set of
representable characters.

  This expansion occurs in two independent directions: font
information and special characters.

  Font information is represented by use of the square brackets in
ASCII to surround any special font material. Five font types are
recognized: (1) italic, (2) MINI-CAPS, (3) bold, (4) subscripts, and
(5) superscripts. Each is denoted by an identifying keyword
immediately after the opening left square bracket, followed by a
space, followed by the material to be in the defined font, followed by
the closing right square bracket.  For example, an italic was is
represented as [italic was], while a mini-caps AMBIENT is [mini
AMBIENT] and a bold syn is [bold syn]. Superscripts and subscripts may
be italic, mini-caps, or bold, and a few superscripted superscripts
also occur, as in 6.24 {times} 10 [sup 10 [sup 10]].

  The dictionary includes a large number of special symbols which are
not representable in ASCII. These include all the Greek alphabet, the
Hebrew alphabet and many miscellaneous other special symbols. All
special symbols which are not available in ASCII (and some that are)
have been given names which are represented by the name encased in
braces, as {degrees} (for a degree symbol) {times} (for multiplication
represented by a small x), {tau} (for the lower-case greek letter
tau), and so on. A complete list of these special symbols is in
Appendix II.

  Each symbol name has been selected to exclude embedded blanks.  Thus
all characters between an opening right brace and its closing right
brace are non-blank. Certain characters which are in ASCII (braces,
brackets, question mark, exclamation mark, and so on) have also been
represented in this way because it allowed them to be used for other
purposes (such as font and special character representation), and they
occurred only infrequently (less than 100 times).

Section Errors in W7.

  While processing W7 both to understand its contents and to put those
contents into a usable form, we encountered a large number of errors.
These errors were of several types:

* Merged illustrations. For example, under false, the illustration was
  "< ~ documents ~ teeth >" and should have been "< ~ documents > < ~
  teeth >". To correct this, we searched for any line of the form
  "<...~...~...>".

* Words containing letters with accents (236 entries).  The accent
  field was wrong about half the time.  The normal problem was that
  the accent was on the wrong letter.  In these cases, the hyphenation
  information generally showed syllables that were two letters two
  long.

* Incorrect values in fields.  The lists in Appendix I were examined
  for rare or inappropriate values; for example, a g in a numeric
  field, or a zero in an alphabetic field.

* Mismatched parentheses or brackets.  We wrote a program to simply
  count parentheses, braces, and brackets.  Many were found to be
  mismatched.

* Duplicate words. We scanned the text for instances where the same
  word occurs twice in a row.  The assumption was that these would be
  places where the last word on a line of the original input was typed
  twice by mistake.  Instead we found a large number of places where
  of or was typed as or or, as a was typed as a a, and the first word
  on an input line was typed twice.

* Incorrect article.  We found all occurrences of a followed by a word
  starting with a vowel, or an followed by a consonant.

All these errors, once found and verified by visual inspection of
the printed dictionary, were corrected by hand, using a text editor.

A last form of error analysis was an attempt to find typographical
errors.  The approach was simple: we extracted a list of all
unique words used in the dictionary definitions.  This produced a
list of 54,298 words.  We compared this list with the list of all
words defined in the dictionary (main entries, variants, and
related words). This reduced our list to 20,292 words that were
used in definitions, but not defined.  Many of these were derived
forms of defined words: past tense, plurals, and so on.  Doing
some simple suffix analysis, we were left with about 8,000 words.
Most of these were apparently Greek or Latin botanical or
zoological names.  Deleting those ending in -ia or -ae and all
words in italics in the dictionary left a list of 2,821 words.

These were checked by hand to produce a list of 903 incorrectly
spelled words. We also found 54 words which were used, but not
defined:

   Australasian bloomery broadheaded clothesline crossbeam
   darkskinned dinnerware doorbell entranceway equivalve etc
   fieldworks flattish foreseen foretold fourpence gunstock hairdress
   hindlimbs homeward hyperactivity leftward lightcolored longlegged
   lowcut messroom Mr. nailhead neckband noctuid nubby parimutuel
   partaken pregenital pyrotechny rangeland rosebush sawteeth
   seneschal sheepdogs shorthaired sightsee snowstorm songbook
   spinymargined spondumene sulphates supersensitized TV twelvefold
   understock upcurved valency workpiece.

We also found a smaller list of words with typographical errors in
the main entry in the computer files.

Of the 903 typographical errors, 543 were the result of a missing
blank between two words.  Of the remaining 360, 34% were a missing
letter, 27% were a wrong letter, 20% were an extra letter, and 13%
were the result of transposed letters.  The remaining errors were
caused by two extra or two missing letters, or by transposing two
letters around a third.  The middle letter in this case was always
a vowel. (For example, 'min' would be typed 'nim'.)

We also found 10 cases of typographical errors in the original
printed dictionary.  These are

[barranca]	gulley => gully
[bitch]		doublecross => double-cross
[capsicum]	genu => genus
[drift]		quantitive => quantitative
[fornication]	NCE => New Catholic Edition
[kid]		goodhumored => good-humored
[lapse]		apostasize => apostatize
[lycopodium]	clubmosses => club mosses
[type species]	permanenty => permanently
[vanity]	knicknack => knickknack

[select]	mismatched parenthesis
[terpineol]	mismatched parenthesis

[tapa]		an Hawaiian (but see 'a Hawaiian' for [luau] and [poi])

[pay]		REQUITE in syn not defined
[rude]		ROUGH in syn not defined

[Lag b'Omer]	33d => 33rd
[one]		3d-person => 3rd-person

It was interesting to follow these errors through the various
printings and editions of the Merriam-Webster dictionaries.  Four
errors were corrected in the 1970 printing of W7, one in the 1971
printing of W7, and one in the 1975 printing of the New Collegiate
Dictionary.  Three errors remain in the 1983 printing of the Ninth
New Collegiate ([bitch], [drift], [vanity]).

------------------------------

END OF IRList Digest
********************