Images of Digital Libraries by Edward A. Fox Paper Version of Keynote Address for NORDINFO Conference: Digital transfer of images Helsinki, Finland Nov. 10-11, 1994 INTRODUCTION A crucial component of the emerging Global Information Infrastructure will be an international, interoperable digital library (Fox, 1993a, 1993b, 1994; Fox & Lunin, 1993) that will provide transparent hypermedia and search access to images, texts and other multimedia documents in a distributed virtual library for the world. For this to develop there must be willing international collaboration, an agreed-upon framework or reference model (Gladney et al., 1994a, 1994b) supported by a corresponding suite of standards, and numerous prototyping and pilot projects to prepare for large-scale development efforts. The vision of a personal tool to access and link information was clearly articulated in (Bush, 1945), which paved the way for the fields of information retrieval and hypertext. Today's information highway or infobahn can carry multimedia information to millions of users of the World-Wide Web, but many important advances are still needed to realize the potential benefits of digital libraries. In the first wave of modern hypermedia technology we mainly see old methods appearing in new forms. Individuals have their own electronic art galleries and vanity presses. Electronic travelers revel in "surfing the Internet", feel comfortable in visiting "home pages", and become excited through "resource discovery" that adds to their "hotlist". Yet, the new modes of human-human communication facilitated by the Internet are changing our social, political and cultural habits, extending the trend of disintermediation that is now transforming the world of publishing (Wiederhold, 1995). From this fragmentation we must forge a new intellectual unity, with digital libraries providing a firm foundation. Modern technology supports moving away from centralization, commuting, conformity, barriers to truth or trade, limitations of time and space, and the cycle of production, warehousing, and distribution. This same technology encourages individuality, integration, interfaces that are universal but personalizable, indexing that is open rather than controlled, and information that is multimedia and organized using hypermedia as well as hierarchical structures. The Information Infrastructure we are building should increase and redefine what we mean by "productivity", and should lead to similar changes in social services, education and entertainment (according to the U.S. Council of Competitiveness). Digital libraries are Grand Challenge Applications, that U.S. Vice President Gore hopes will connect our classrooms, libraries and hospitals. Clearly there is need to scale up our electronic information services in terms of content, connectivity and convenience of use by large, heterogeneous user communities. People hope for a service that will cost no more than cable TV, that will support communication as conveniently as our global telephone system, and that will be as responsive as our computer game units or office workstations. How can we build these digital libraries (DLs)? Some important principles are highlighted in (Fox et al., 1993b). Key lessons can be learned from pilot projects like the CORE effort that deals with chemical literature (Entlich et al., 1995). However, we would be wise, before proceeding too far, to identify DL requirements and to agree upon an open DL architecture. DIGITAL LIBRARY REQUIREMENTS AND ARCHITECTURE (Gladney et al., 1994b) gives an overview of the longer explanation found in (Gladney et al., 1994a) of architectures for digital libraries. These works appeal for a scholarly and collaborative initiative to determine requirements for digital libraries, to reach consensus regarding an open architecture for them, and to push for a suite of standards so that digital libraries (DLs) can interoperate. DL requirements can be broken down into those that are characteristic of traditional libraries and those that are peculiar to DLs. In the first category are functional requirements of libraries: collection, organization, representation, retrieval, access, analysis, synthesis, and dissemination. On the data and information level there is need to manage metadata and a catalog that facilitates access. Translations and cross references are some of the value-added information that broadens the user access base. Such access may be controlled, with different rights afforded to different patrons; a DL must support a variety of access policies that may be imposed by DL staff, information providers, or outside regulators. Other roles of traditional libraries include: storage, archiving, preservation, and interchange. Requirements of DLs that are not common in traditional libraries include: capture, digitization, document markup, linking, electronic interchange, information retrieval, direct support of tasks on user desktops, and human-computer interface usability. In the light of current distributed computer systems, a client-server architecture seems appropriate. On the client side must be a presentation manager for various applications, each of which rests upon suitable document managers. Caching is needed on the workstation or in the network. A document storage subsystem is required, but may be in either or both of the client or server. The server must have database, information retrieval and/or file management capabilities, for both the information collection and the catalog that facilitates access to it. Another view of DL architecture is that it must have a variety of layers, such as those supporting distributed data services. Suitable applications can be built more easily upon a layer of application enablers, which make use of resource managers, that in turn depend on operating system and communication software. Application enablers include: document analyzers or indexing routines, folder managers, link engines, query handlers, resource selectors and fusers, and visualization software (e.g., to create summaries or thumbnail sketches). Resource managers include authentication and authorization servers, DBMSs, library servers (e.g., blob, cache, catalog), output servers (e.g., print, FAX), search engines, and video servers. These managers may control protected resources, and may keep logs or other records. A special type of manager found in DLs is the document manager, that embodies a particular document model. Some models relate to content, such as for CAD, GIS or image collections. Folder managers are another example, but the most typical document manager is for electronic versions of library holdings: pages, articles, journals, books, etc. Finally, there are hypermedia document managers like Mosaic and Hyper-G, which have recently become very powerful and very widely used. If this program of developing requirements and architectures is to be successful, it must be embodied through the development of interoperable systems that follow suitable standards, some now available (e.g., Z39.50, SGML, JPEG, and MPEG) and others yet to emerge (e.g., for a reference model or scripting). Some of these needs are made clearer as a result of examining the various DL projects underway at Virginia Tech. DIGITAL LIBRARY PROJECTS AT VIRGINIA TECH At Virginia Tech, a rich infrastructure has been developing that is particularly well suited to digital library pilot efforts. For a decade, freshmen entering the Engineering College have purchased personal computers. The year after that initiative started, freshmen in Computer Science began purchasing UNIX workstations. At this campus of roughly twenty-three thousand students, there are more computers than telephones and in the community at large (i.e., the Town of Blacksburg), roughly 40% of the population has a computer. The Town, University and Bell Atlantic "opened" the Blacksburg Electronic Village in Fall 1993, and now thousands have Internet access from their apartments, dorm rooms or offices. (NSF has just funded an effort to develop an electronic design history of "BEV".) All faculty and staff are involved in a four-year repeating cycle in which they receive a week of training and a workstation, equipping them to work on the Internet, handle electronic mail, participate in campus decentralized computing initiatives, and create electronic courseware to improve students' educational experience. Ethernet services run atop an FDDI backbone, which will be replaced by ATM service during 1995. Because of this infrastructure and the large community of faculty, staff and students interested in the areas of material science and engineering, Virginia Tech was included in Elsevier's TULIP project. Roughly 40 journals in that area are being received on CD-ROM that contain (300dpi) bitmap page images along with bibliographic and table of contents data. In 1995, this collection will be accessed on 486-type systems running OCLC's Guidon interface, connected over the network to the OCLC Newton search system that has been adapted to process and provide access to Elsevier's roughly 40 gigabytes of image data along with related text and indexes. A comprehensive usage study will involve extensive logs of accesses, that are restricted to be from the campus or community. In a second project, Virginia Tech has taken the lead in plans for capturing, archiving and disseminating electronic theses and dissertations. Working with University Microfilm International (UMI), the Council of Graduate Schools, the Coalition for Networked Information, and other groups, Virginia Tech's Graduate School, Library and Computing Center have been exploring this problem domain for several years. In particular, the Monticello Electronic Library initiative of SURA (Southeastern Universities Research Association) and SOLINET (SOutheastern LIbrary NETwork) launched a working group on electronic theses, dissertations, and technical reports in 1993 and in an August 1994 workshop held at Virginia Tech developed a phased plan to test the key concepts. Using Adobe Acrobat tools initially, and later adding in mechanisms for conversion to SGML, a pilot group of Southeastern universities will explore the viability of electronically capturing theses and dissertations. Ultimately, this effort should lead to a digital library of graduate research publications, that should be of great value for graduate education as well as for dissemination of research results. A third effort, the Wide Area Technical Report Service, WATERS (Maly et al., 1994a, 1994b), has been underway since 1992, with support in 1993 from the National Science Foundation, and is being coupled with the Monticello Electronic Library. Principal investigators are located at Old Dominion University, State Univ. of NY at Buffalo, the University of Virginia and Virginia Tech. Here the focus is on technical reports in the topical area of computing. A particular challenge is to unify the efforts of this team and at least three other teams that have each adopted very different approaches --- so that all computer science departments can contribute reports and can search, browse and retrieve full-text publications to the workstation. The final project to consider involves building a DL in computer science (CS) and applying it to improve CS education. This effort traces some of its concepts to ideas shared at a workshop on distributed expert-based information systems (Belkin et al., 1987). The CODER (COmposite Document Expert/extended/effective Retrieval) project involved building such a system at Virginia Tech, to serve as a testbed for artificial intelligence (AI) techniques in information retrieval (Fox, 1987). Many of the lessons learned, some of the knowledge bases developed, and several of the communication schemes devised for distributed AI were used in developing a next-generation online public access catalog system, MARIAN, at Virginia Tech (Fox et al., 1993a). While scheduled for direct use by the campus community in 1995 to access almost a million records of library catalog data, MARIAN is also being used in the Envision system. With funding from 1991-1995 by NSF, and support by ACM, the Envision team has prototyped a DL system for the computing literature (Fox et al., 1993b). Part of the work has involved developing SGML Document Type Definitions, converting typesetter data into an SGML archive based on those DTDs, and building a large collection of bibliographic records, review articles, full-text technical articles and video materials. Thousands of page images have been scanned in, and coupled with bibliographic records. A small collection of MPEG data has been prepared using special compression software, for use in educational activities (Fox & Abdulla, 1994). Project activities also have included developing the Envision system. One component of that is a specialized object-oriented database system being developed by G. Averboch to replace the earlier system programmed by QiFan Chen. The largest component is the Envision backend system, that makes use of a version of MARIAN for searching. It manages data in an SGML archive, and converts documents that are selected for display to HTML, so they can be presented using a Mosaic browser. The backend talks with a specially tailored interface for query formulation, listing results, and visualizing the result set (Nowell et al., 1994). Overall, a user-centered design approach was undertaken; usability tests have shown keen appreciation of the interface. The Envision system is part of the infrastructure supporting another NSF project, "Interactive Learning with a Digital Library in Computer Science", being carried out by investigators at Virginia Tech and Norfolk State University during 1993-1996 (Fox & Barnette, 1993). Support is given not only by ACM but also by other publishers such as the IEEE Computer Society. Extensive use is made of the World-Wide Web and browsers like Mosaic as well as the Hyper-G system. By the end of 1994 there were three "paperless" courses, "Networked Information" for freshmen, "Computer Professionalism" for juniors, and "Information Storage and Retrieval" for graduate students. (See http://ei.cs.vt.edu/EIproj.html for more information.) The first of these is a new one-credit course that can serve to promote "information literacy" and which covers not only tools available on the Internet, but also fundamental concepts of (digital) library and information science. Another new course aimed at seniors will be offered in similar fashion starting in Spring 1995: "Multimedia, Hypertext and Information Access". All told, perhaps 10 courses will be enhanced as a result of this project. Besides the Envision system and Mosaic, the KMS system (Akscyn et al., 1988) has been extensively applied in at least 3 courses. It is used to access the "ACM Hypertext Compendium" and also to help students learn about hypertext and writing. It serves as an excellent computer supported cooperative work system, with near instantaneous sharing of "frames" even among remote sites connected by the Internet. With all these tools, almost half of the Department of Computer Science at Virginia Tech will be engaged in trying to improve curriculum and learning in computer science through use of our prototype digital library. While other groups build specialized multimedia-based courseware, at enormous expense, the key concept in our project is to construct and build upon a DL so that courseware development and use of electronic reference materials is greatly simplified. CONCLUSIONS While Bush spoke of a "memex" to provide access to the world's information in 1945, half a century later we are building digital libraries that partially realize his vision. If we follow the normal software engineering and international standards approaches of developing requirements, an architectural framework or reference model, and a suite of supporting standards, we should be able to construct interoperable systems that will allow a global "virtual" worldwide digital library. Today we see glimpses of this future DL in various efforts, like those being carried out at Virginia Tech. People will work with page images of old publications, and make use of visualization methods to manage and gain an understanding of search results. Learning should improve, as paperless courses are developed atop DLs, where students interact in a variety of ways, and can keep track of their own trails, add links or nodes, and make personal annotations. The global knowledge network will evolve through a multitude of collaborative efforts, and individuals' mental association networks will be enhanced and extended through these electronic representations and tools to help manage them. All in all, digital libraries have great promise for facilitating learning, discovery, and dissemination of knowledge. ACKNOWLEDGMENTS The work at Virginia Tech described herein is the result of effective collaborations between several teams of faculty, staff and students. In the Envision Project, L. Heath has played the important role of manager during the last crucial year of our efforts. D. Hix supervised the very careful interface design and development work of L. Nowell, who supervised the coding done by E. Labow. G. Averboch, D. Brueni and W. Wake made numerous contributions as graduate research assistants. On the related Interactive Learning project, co-principal investigators D. Barnette, H.R. Hartson, JAN Lee, and C. Shaffer have all made significant contributions, as have Norfolk State Univ. co-investigators S. DeLoatch and J. Urquhart. Y. Su has provided general graduate research assistant support. In other related projects, co-principal investigators T. Nutter and M. Abrams played crucial roles. Staff members B. Cline and R. France have carried out the MARIAN development effort, with a variety of student support, especially by S. Teske; R. France has also helped with numerous other investigations over the last decade. Graduate assistants G. Abdulla and K. Dalal have assisted with related studies. Special thanks go to the National Science Foundation and PRC Inc. for funding many of our efforts, and to ACM and KSI for contributing data, software and support. Finally, thanks go to the Virginia Tech Department of Computer Science and the Computing Center for extensive cost sharing contributions. REFERENCES (Akscyn et al., 1988) R. Akscyn, D. McCracken, and E. Yoder. KMS: A Distributed Hypermedia System for Managing Knowledge in Organizations, Communications of the ACM, July 1988, 31(7): 820-835. (Belkin et al., 1987) N. Belkin, C. Borgman, H. Brooks, T. Bylander, W. Croft, P. Daniels, S. Deerwester, E. Fox, P. Ingwersen, R. Rada, K. Sparck Jones, R. Thompson, and D. Walker. Distributed Expert-Based Information Systems: An Interdisciplinary Approach. Information Processing & Management 1987, 23(5): 395-409. (Bush, 1945) V. Bush. As We May Think. The Atlantic Monthly, July 1945, 176: 101-108. (Entlich et al., 1995) R. Entlich, L. Garson, M. Lesk, L. Normore, J. Olsen, S. Weibel. Making a Digital Library: The Contents of the CORE Project. Communications of the ACM, 38(4), April 1995, to appear (as short article --- long version will appear in ACM TOIS). (Fox, 1987) E. Fox. Development of the CODER System: A Testbed for Artificial Intelligence Methods in Information Retrieval. Information Processing & Management, 1987, 23(4): 341-366. (Fox, 1993a) E. Fox. Digital Libraries ("hot topics" section), IEEE Computer, Nov. 1993, 26(11): 79-81. (Fox, 1993b) E. Fox, ed., Sourcebook on Digital Libraries: Report for the National Science Foundation, TR-93-35, VPI&SU Computer Science Dept., Dec. 1993, Blacksburg, VA. Available by anonymous FTP from directory pub/DigitalLibrary on info.cs.vt.edu, over 400 pages. (Fox, 1994) E. Fox. How to make intelligent digital libraries. In Methodologies for Intelligent Systems, Proceedings of the 8th International Symposium, ISMIS '94, Charlotte, NC, Oct. 1994. Lecture Notes in Artificial Intelligence 869, Springer-Verlag, Berlin, 27-38. (Fox & Abdulla, 1994) E. Fox and G. Abdulla. Digital Video Delivery for a Digital Library in Computer Science. High-Speed Networking and Multimedia Computing Workshop, IS&T/SPIE Symposium on Electronic Imaging Science and Technology, Feb. 6-10, 1994, San Jose, CA, 7 pages. (Fox & Barnette, 1993) E. Fox and D. Barnette. Improving Education through a Computer Science Digital Library with Three Types of WWW Servers. In Proc. Second International WWW '94: Mosaic and the Web, WWW'94, Chicago, IL, Oct. 17-20, 1994. (Fox & Lunin, 1993) E. Fox and L. Lunin. Introduction and Overview to Perspectives on Digital Libraries. Journal of the American Society for Information Science (JASIS), Sept. 1993, 44(8): 441-443. (Guest editor's introduction to special issue) (Fox et al., 1993a) E. Fox, R. France, E. Sahle, A. Daoud, and B. Cline. Development of a Modern OPAC: From REVTOLC to MARIAN. Proc. 16th Annual Intern'l ACM SIGIR Conf. on R & D in Information Retrieval, SIGIR '93, Pittsburgh, PA, June 27 - July 1, 1993, 248-259. (Fox et al., 1993b) E. Fox, D. Hix, L. Nowell, D. Brueni, W. Wake, L. Heath, and D. Rao. Users, User Interfaces, and Objects: Envision, a Digital Library. Journal of the American Society for Information Science (JASIS), Sept. 1993, 44(8): 480-491. (Gladney et al., 1994a) H. Gladney, Z. Ahmed, R. Ashany, N. Belkin, E. Fox and M. Zemankova. Digital Library: Gross Structure and Requirements (Report from a Workshop). IBM Research Report RJ9840, IBM Almaden Research Center, May, 1994. Virginia Tech Dept. of Computer Science Technical Report 94-25, June, 1994. Available by anonymous FTP from directory pub/DigitalLibrary on info.cs.vt.edu as RJ9840.ps (Gladney et al., 1994b) H. Gladney, E. Fox, Z. Ahmed, R. Ashany, N. Belkin, and M. Zemankova. Digital Library: Gross Structure and Requirements: Report from a March 1994 Workshop. Digital Libraries '94, June 19-21, 1994, College Station, TX, ed. J. Schnase, J. Leggett, R. Furuta, T. Metcalfe, 101-107. (Maly et al., 1994a) K. Maly, J. French, A. Selman and E. Fox. Wide Area Technical Report Service, TR_94_13, Old Dominion Univ. Dept. of Computer Science, June 1994. (Maly et al., 1994b) K. Maly, J. French, A. Selman and E. Fox. The Wide Area Technical Report Server. In Proc. Second International WWW '94: Mosaic and the Web, WWW'94, Chicago, IL, Oct. 17-20, 1994, 523-533. (Nowell et al., 1994) L. Nowell, E. Fox, L. Heath, D. Hix, W. Wake and E. Labow. Seeing Things Your Way: Information Visualization for a User-Centered Database of Computer Science Literature, TR-94-06, VPI&SU Computer Science Dept., Jan. 1994, Blacksburg, VA. (Wiederhold, 1995) G. Wiederhold. Digital Libraries, Value, and Productivity. Communications of the ACM, 38(4), April 1995, to appear.