Guided Tour to the UC Berkeley's Digital Library Project

Documents

Welcome to the UC Berkeley Digital Library Project!

This tour will help introduce you to our project's tools and collections in the area of digital documents. We'll take you through the following elements:

  • Simple Document Access
  • The "Multivalent Document" Model
  • Tilebars Access

    Before you start, please narrow this window, and click here to bring up a second window with our project home page in it. Adjust these windows so you can read the tour and comfortably operate on the second window simultaneously.

    In the other window, you should see our home page. After the heading "Our Collections:", click on the link labeled Environmental Documents now.

    Simple Document Access

    Let's begin examining our collection of scanned image documents. You can access this collection in a variety of standard ways (and, as we'll soon see, some novel ways).

    First, be sure that our Documents page is in the window on the right. Now, let's access some documents by filling out a search form. Click on the Query link under "Ways to Search for Documents". You should now be looking at a web form. In the space labeled Title, insert the words "water conditions". Now hit the button labeled Search. The result is a list of all documents with the terms "water conditions" in the title. (Our collection is about the California environment, so there are a lot of these.)

    We can now examine any of these reports, but let's look at document with the identifier Elib-17 (the number all the way to the right; this is "Water Conditions in California, Report 3"). Probably this is near the top of the list of documents returned. Click on the title ("Water Conditions in California Report 3"). What comes up is the title page for this report. This page provides some summary information about the documents, as well as a number of different ways to view it. Let's start by just examining the page images. Click on the button labeled go to page. (Scanned pages are moderate size items, so if you are on a slow connection, this may take a while to load.)

    You should now be looking at the cover page of this report, which contains, among other things, a nice picture of snowy woods. At the top and the bottom of the page are a row of buttons for moving around. For example, if you press the Next Page button (go ahead), you'll be taken to the next page of the document.

    Like most page image systems, this mode of access enables you to stare at the various pages, but you can't do much more. Page images are, after all, essentially photos of the page. We can get a bit more functionality via OCR Text. Click on the button with this label at the top of the page. You are now looking at the text extracted from the document by an OCR ("optical character recognition") process. You can do a number of useful things here, such as searching the text, or selecting it, or scrolling through it quickly, i.e., whatever your browser allows you to do with text. For example, if you now click on the Next Page button, you can see the text extracted from page 3. This contains some rather curious text off to the left. To see why, click on the GIF Image link, which will move us back to the image of that page. As you can see, the image contains a map. The puzzling looking text is the result of running a standard OCR process on such a figure.

    Full-text Access

    If you return to the Documents query form, you can see that we can access documents by other bibliographic data, such as their author, document types, and so forth. In addition, we can also access document by their "full text", or with any combination of full-text and bibliographic information. To do a full text search, enter some terms in the Full Text slot, and press the "Search" button. For example, if you enter the terms "coho salmon" in this field, you will retrieval the documents in our collection in which these terms occur (532 as of this writing). If you return again to the query form (hit the browser's "back" button in the other window, in hopes that "coho salmon" is still in the full text search slot), and add, say, "wildlife" to the author slot, and set the type to "plan (misc.)", then hitting the "search" button will list all our documents which are plans in which "coho salmon" occurs in the text, and whose author contains the term "wildlife". (There are three of these as of this writing.)

    Multivalent Documents

    Now let's look at a new way to think about documents, what we call "Multivalent documents" (MVD).

    Our implementation of MVD is in Java, which means you will need a reasonable Java-compliant browser for this to work. In particular, MVD seems to work on HotJava most anywhere, on Netscape 4.03 or later most anywhere, and on Microsoft Internet Explorer 3.02, under Windows NT. (Unfortunately, IE under Windows 95 has a Java bug the prevents this code from working properly. Microsoft is working to fix this. IE 4.0 seems worse than IE 3.02, especially for scanned images.)

    A multivalent document comprises layers of related data, and "behaviors", dynamically loadable pieces of functionality. Most functionality is provided by the individual behaviors a document specifies. The MVD implementation provides a framework within which behaviors can interoperate. You can provide whatever kind of functionality you like, or even create new kinds of documents, by writing your own behaviors and assembling them (and associated layers of information) into multivalent documents.

    We will now demonstrate some of MVD's capabilities. First, we will show you multivalent documents that use behaviors which "enliven" scanned document images. We will use some of these behaviors to manipulate the document in interesting ways. We will then show you how you can annotation a document using MVD. Finally, we will apply MVD to a completely different document type, namely, HTML.

    We begin by looking at the MVD "version" of a document in our scanned image collection. You can access the MVD version of any document in our DLIB collection simply by locating the document you want from our project server, and then either selecting MVD on that document's home page, or going to a scan page image and then clicking on the MVD icon at the top or bottom of each page. We'll guide you through doing so on a document we looked at before, elib-17. We're going to start on page 8 of this document, which you are free to try to locate yourself in the other window, or go to directly by clicking here.

    As before, we are now looking at a scanned document image, so there is little we can do other than look at the document. To see the MVD "version" of this document, click on the button labeled MVD at the top of the page. A new window should appear. (It may take a moment or two to do this for the first time, as you are loading the MVD Java code over the network. It is about the size of a large GIF file.) This new window should contain the same page that we were just looking at in the web browser, but with a menu bar across the top. We'll direct our attention to this window for the rest of the MVD tour, so move it so you can continue reading this tour and using the new window simultaneously. (While we won't use the other browser window for the time being, don't kill it, as it is required to be around while MVD is running.)

    (If the new window didn't come up, or it doesn't conform to the description above, please check to be sure you are in one of the browsers we require. In general, if MVD--or any other large Java applet--seems to be misbehaving, we recommend quitting the browser, and starting again. It is an even better idea to clear the cache first.)

    For our first trick pull down the "Edit" menu, just like you would on any application. (I.e., move the cursor over "Edit", click on the left mouse button, and keep it pressed as you select an item.) Select "Search", at the bottom of the menu. A small window should come up. In this window, there is a little button followed by the text Inc. Please click the button. Doing so should result in a check mark showing in the button, indicating that we have turned on "incremental search" mode. Now, in the window, pausing after each character, type a few characters, e.g., "RIV". (You may have to click in the text entry portion of the search widget before you can insert text. Also, be sure to type capital letters for this example.) As you type, you should see each word matching the letters you typed so far become outlined in the image.

    Type a vertical bar, i.e., "|", in the search window. Now type another string, e.g., "67" or "MOU". You should see words matching these characters highlighted in another color.

    Let's put the search tool away for now (by clicking on its Close button), and let's try another trick. Position the cursor over some text, say, the word "TELEMETERED" near the top of the page. Click on the word, and, holding the button down, drag it a few words to the right. When you first click, the background of the underlying word should turn gold, and as should that of the words you drag over, until you let up the button.

    You have just selected the text corresponding to these highlighted words. On some systems, the text is already in your cut buffer. In others, you have to issue an explicit copy command. (On some Solaris machines, this is accomplished using a copy button; under Windows NT or 95, you must hit Ctrl-c.) Now, move the mouse to another application into which you can write, e.g., your favorite text editor. Do a paste. (On most UNIX boxes, this is accomplished by clicking the middle mouse button; on others, you may have to hit a paste button; on Windows, you must do a Ctrl-v.) With any luck, the text you have selected should be pasted in your application.

    (You might notice that what you paste contains what appears to be a typo init. That is becuase the text has been derived from the image by an OCR is not 100% accurate. MVD doesn't DO the OCR; it just relies on what it is given.)

    Before we go on, let's reflect a bit about what we just did: We are looking at a page image, a format that normally is not very amenable to user interactions. Nevertheless, we just selected a portion of text from an image, and pasted it as if we were in a text editor or markup language previewer. This was possible because the multivalent infrastructure build a single representation of the document from multiple layers of related data, and allow behaviors, programs that interact with each other, and with the user, to exploit this structure. In particular, in this example, the document has a scanned image layer (i.e., a picture of a page of a document), and a layer contains the text as determined by an OCR process, along with positional information. One behavior builds a representation out of these layers. Others, e.g., the Search behavior, operate over it to produce interesting results.

    Now let's look at some other ways layers and behaviors can contribute to the functionality of a document. Pull down the "Lens" menu, and select "Show OCR". You should see a small window appear over a portion of the image. If you grab this window by its title bar (i.e., its top), you can see that you can move it about the page. If you click and drag its lower right hand corner, you can see that you can resize it. The contents of the window show you what the OCR process produces for that area of the image. For example, if you move this "lens" over "WATER" or "TELEMETERED", at the top, you can see an OCR error.

    Leaving the lens on the screen (you can remove it by clicking in the small box in its upper right hand corner), go back to the lens menu and select "Bit Magnify". This lense magnifies what is underneath it. If you move it partially over the OCR lens, you will see that it magnifies the contents of that lens where it overlaps it.

    Let's put these lenses away for the moment.

    Annotating Documents

    Now let's look at a different set of behaviors. Select some text in the image, say, "MEDICINE LAKE" in the second table portion. Now, select from the Anno (i.e., "annotation") menu the item "Add Highlight". The background of the formerly selected text should now become marker-pen yellow. I.e., you have annotated the image with a highlight marker.

    We have implememented a number of annotative behaviors. For example, select some other region, say, "SLATE CREEK". Now select from Anno menu the entry "Add Hyperlink". A new small window should appear. Enter any URL in this window, and click "OK". The selected text should become underlined. Moreover, moving the curser over this span will cause it to change form, and the hyperlinks destination will be shown at the bottom. Clicking on the span will direct the browser to the specified location. (Don't do this just this minute. We'll say why in a moment.)

    (You can also set an anchor in a document, i.e., a place to jump to. We'll discuss this below also.)

    Another interesting class of annotations are "copyeditor marks". Select some text again, say the word "DEPARTMENT" at the very top. The select one of the entries in the CopyEd menu, say "Italics". The selected span will be annotated to reflect this comment. The other copyeditor marks should be self-explanatory. E.g., "Insert" will bring up a little box for you to enter text that you feel should be inserted at the beginning of the selected span.

    Finally, in the Anno menu, select the "Note/New Note" entry. A Post-it styles note will appear on the screen. You can now type a message in the note. You can move it around or put it a way, just like a lens. (In fact, notes are implemented as "opaque" lenses.) (Sometimes you may loss the focus in a note, i.e., find nothing is happening when you type there. If this happens, try moving the cursor over some of the text you have already typed, and see if you can get the note's attention again.)

    At this point, you may want to play with the application a bit. In doing so, you can get help from the Help menu. One particular useful entry is "Help on Menu Item". If you select this menu item, then the cursor should change to a crosshair. Now select a menu item again. A web page should be displayed describing the menu item and its underlying behavior.

    Saving Annotations

    Once you have made some annotations, you can save them using the "Save As" menu item in the File menu. Saving will produce a new "hub document". A hub document is essentially a list of the distributed layers and behaviors that comprise a multivalent document. For example, when you access a scanned page image document using MVD, as we did above, what you are really doing is launching the MVD applet on a hub document. It is in the hub document that particular behaviors and layers are listed. E.g., the hub document for a particular scanned image document in our repository says to use the behavior that knows what to do with the images and the OCR created from them. This hub document also specifies the locations of the particular scanned image and OCR files for this document. A hub lists all the behaviors and layers that we want to comprise a mulivalent document. E.g., "Search" is included in hubs for just about every document; a behavior like the "Show OCR" lens, which is specific to scanned image documents, is included in all our scanned image document hub, but not hubs for other base layer types (about which more below). A hub document might also include behaviors specific to a particular document.

    In effect, the hub document is essentially a multivalent document looks like when it is sitting in a repository. Of course, due to the nature of multivalent documents, the hub document itself doesn't usually contain much of the actual document content. The content is generally distributed around the network; the hub is just a list of the document components and instructions on putting them together.

    Now, when we created the hub documents for our scanned images, we created hubs that describe what are in effect plain vanilla scanned image MVD documents. However, we could have included in these hubs any number of the pre-made copyeditor marks or hyperlinks or notes, or what have you. (In fact, you will find that there is at least one note in each document; this describes the creation of the document.) Since elements like copyeditor marks or hyperlinks or notes are uses of behaviors that are separate from the "base" of the document (e.g., the scanned image), one could have any number of different hub documents, each with different annotative behaviors, but all annotating the same underlying image layer.

    When you use "Save As", what you are doing is creating a hub document that reflects what is doing on in the applet at the time. For example, if you authored a hyperlink, saving will produce a hub document that includes an entry in the hub document specifying this link. Of course, the new hub document will also include the behaviors and layers that were used to open the document initially, i.e., the contents of the original hub document. So, if you ask MVD to open the new hub document you just created, it will recreate the scanned image document with the hyperlink, etc., just as you put it.

    Because of Java security features, the MVD applet can't write the hub document to an arbitrary place. Instead, we let you write the hub document in a scratch directory on our server. For example, if you type into the "Save As" widget the file name "myhub.mvd" (by convention, hub documents end in ".mvd") and then hit "OK" or return, you will create the URL http://elib.cs.berkeley.edu/annotations/mvd/myhub.mvd.

    Remember, even though this is a perfectly valid URL, ordinary web browsers don't (yet) know about MVD. So if you try to look at this in a web browser, you would just see the raw hub document. Go ahead and do this if you like; you might find it interesting. However, looking at a hub document this way won't launch the mulivalent document.

    So, what can you do with this? Well, a number of things:

    Note that you can open a hub document with annotations, and add more annotations. You can even annotate some annotations, For example, you can put copyeditor marks on a note. You can then save again, producing another hub document. (Right now, there isn't any easy way to distinguish one set of annotations from another, or to annotate an annotation like a copyeditor span.)

    Use the "Help on Menu Item" facility described above to learn more about saving and opening hub documents.

    Moving around, etc.

    Some other things to try: You can move to different pages of a document using the "=>" and "<=" buttons surround the page number at the upper left, by inserting a page number and hitting return, or by using the menu items under Go. One thing to note is that if you move off a page, all the annotations that you made will be lost. This is a limitation we will overcome, someday.

    You can close the MVD window by choosing Quit in the File menu.

    MVD on other document types

    MVD is not restricted to scanned paged images. In principle, it can be made to work with any media type, and we have created "media adaptors" that know about HTML and ASCII. You can open any ASCII or HTML web page simply by specifying its URL to the "Open" widget available through the File menu. (However, we haven't yet implemented every HTML tag. In particular, tables won't always get layed out right, and frames and java and javascript are ignored entirely.) Or, if you follow http hyperlinks within MVD, you will open the corresponding resources inside the MVD applet.

    As an example, you can open the DARPA home page by putting in their URL, "http://www.arpa.mil" into File/Open widget. Now you can mark up the page using the annotation behavior we demonstrated above. As an example, if you open "http://elib.cs.berkeley.edu/annotations/mvd/arpa-anno.mvd", you will see some annotations we made on this page.

    One interesting feature here is that, on structured text like HTML, copyediting marks are executable. Follow the instructions in the post it note to see this and several other features.

    This example also illustrates that the contents of notes are also annotatable. One example is a hyperlink in the note. This link points to a local anchor in the main document. That anchor was created by selecting a region of the document, and then chosing the "Set Anchor" menu item in the Anno menu. The hyperlink in the note was created just as we did above, except that the link was described preceeding the anchor name with just a "#". The help menu items for the relevant menu images specify more detail.

    This completes the MVD portion of this tour. You can use "Quit" in the File menu to remove the MVD window, and continue on with tour using the second browser window.

    TileBars

    Now let's look at a different way to access documents. Go back to the Documents page, and click on TileBars search. The form that comes up allows you to enter two term sets (and the result requires Java). The idea is that you might want to contrast the distribution of term sets in documents. For our tour, let's enter "Berkeley" in the box labeled Term Set 1:, and "Santa Barbara" in the one labeled Term Set 2:. Now hit the Search button. Both term sets are shipped off to the server, which will use them to do full-text queries on the collection.

    The result will be visualized as "TileBars". (For some reason--we are looking in to it--the Java applet takes an unconscionable long time (minutes) to get started. You may want to find something else to do for a minute or two, or check back with use after we've had a chance to debug this.) The TileBars are the (mostly white) rectangles you see in the result, one corresponding to each document in the result set. At the top of the results page, before the TileBars, is a group of buttons. These sort the TileBars according to how relevant they are to each of the two term sets. Both in the buttons and in the TileBars, the degree of relevance is indicated by the darkness of the square. For example, the first button (on the left) will sort the results so that those highly relevant to both term sets will be on top; the second button will sort the results giving priority to those highly relevant to the first query and only somewhat relevant to the second.

    The number in red in the middle of each button indicates how many documents of that type were found. TileBars will come up showing the first sort to have more than zero items to display. (Often, the first button will contain a 0, indicating that no documents were found that were deemed highly relevant to both term sets.) The sort that is selected is indicated by the darkened diamond inside the associated button. Therefore, you should see as exactly as many TileBars above the black bar as is specified by the red number on the selected sort.

    Before examine the TileBars themselves, let's use the sort buttons. E.g., clicking on the first button will show a sort of these documents that are highly relevant to both queries; clicking on the fourth button shows TileBars representing those documents viewed somewhat relevant to each term set. Before we go on, click the one of the buttons that has a non-zero number of items at least somewhat relevant to both queries.

    Now, drag the cursor over any of the TileBars. You should see a red rectangle appear on the TileBar, along with a page reference on the bottom line. The arrows at the sides of some TileBars scroll the bar; these are used for long documents. Try clicking on these for the first TileBar. (The "X" indicates a sizeable run of pages with no relevant terms.)

    Position the cursor on a TileBar square that has a darkened top and a darkened bottom. Clicking on the tile will bring up this page as an MVD document. As you can see, if you are patient enough, the page comes up with the term set items highlighted. (Actually, it currently brings up an older version of the MVD. It works a bit differently than the description above, but you can press its "help" and see what is available.)

    Document Recognition, Images, and Other Data

    This completes the tour of the Berkeley Digital Library's document collection and document-oriented research. You can eliminate the second browser window now, resize this one, and continue. Our other document related work includes document recognition, and is described along with our Advanced Structured Document Examples. Other information about our document work, including how we process documents, are available from our About Documents page.

    Return to our Digital Library Tours page tosee what other tours are available.

    Be sure to sign our Guestbook now that you have finished the tour.