IRList Digest Saturday, 27 August 1988 Volume 4 : Issue 45 Today's Topics: Query - NY Times Collection Abstracts - Dissertations selected by S. Humphrey [Part 2 of 5] News addresses are Internet: fox@fox.cs.vt.edu or fox%fox.cs.vt.edu@dcssvx.cc.vt.edu BITNET: foxea@vtcc1.bitnet (replaces foxea@vtvax3) ---------------------------------------------------------------------- Date: Tue, 2 Aug 88 16:52 EDT From: krovetz@UMass Subject: NY Times test collection Ed, Bruce Croft told me you might know about a test collection of articles from the NY Times. He said it was pretty small. Could it be sent in a mail message? Thanks, Bob [Note: There is an old collection used in some of the SMART studies that might be available from Cornell - this is the only one other than the big collections (probably without queries) of NY Times documents made at several locations. Let me know if you turn up something and we can put it on Virginia Disc One (which is now almost ready!) - Ed] ------------------------------ Date: Wed, 3 Aug 88 13:36:58 EDT From: "Susanne M. HUMPHREY" Subject: dissertation abstracts [Note: Part 2 of 5 - Ed.] .[ AN University Microfilms Order Number ADG87-29523. AU YU, JUNG-PIN. IN The University of Iowa Ph.D 1987, 203 pages. TI LIBRARY SEARCHES OF INFRARED SPECTRA AND MULTICOMPONENT ANALYSIS. DE Chemistry, Analytical. AB Odd moments of the cross-correlation function and average wavenumber methods are used for library searches. The absorbances of a spectrum are represented by 24 values if the spectrum is divided into twelve regions. The indirect spectral data base is split into twelve subsets according to the spectral regions with the most intense integrated intensities. Only several subsets are used for a search. Several versions of integrated intensity filters are applied to pre-select the library entries. The best method of this work presently has about an 84% chance of locating an experimental spectrum among the best five matches if this spectrum is in the library and the correlation coefficient between them is not less than 0.9000. The average search time per spectrum is approximately 30 seconds with a Leading Edge Model D computer equipped with an 8087 co-processor. The search results by the correlation coefficient method are comapred, and the effects of baseline shift, noise, and frequency shift are also investigated. A factor analysis of n-component spectra generates n orthogonal spectra. Proper combinations of any two of the abstract spectra can eliminate one component and yield new spectra with n $-$ 1 components. The orthogonal spectra obtained by a factor analysis of the newly generated spectra have n $-$ 1 components each. Iterative subtraction processes finally resolved all pure component spectra. A minimum entropy method and a minimum pathlength method are used to evaluate the component elimination procedures. The relative component fractions of a sample with indeterminate pathlength can be determined by using normalized pure component spectra as references. The relative component fractions of a mixture is constant for each component of all samples. Q matrix and T matrix methods are investigated. Singular value decomposition method is found to have utility in the spectral reconstruction. .] .[ AN This item is not available from University Microfilms International ADG05-62008. AU CASAS-RAPOSO, IGNACIO ANTONIO. IN University of Toronto (Canada) Ph.D 1986. TI PROPHET: A LAYERED ANALYTICAL MODEL FOR PERFORMANCE PREDICTION OF DATABASE SYSTEMS. DE Computer Science. AB The efficiency and performance of database systems is substantially affected by many decisions which are required throughout the development stages. Unfortunately, the impact on performance of many design alternatives is not yet well understood. In this thesis we propose an analytical performance model of database systems that provides a means of quantitative comparison of design alternatives. Four levels of abstraction are distinguished for design choices, models of representation, and related performance measures. Queuing network models are used at the bottom, most concrete level. Several subproblems within the layered framework are studied in detail including schema and internal access plans models and a performance model for database buffer management. We also extend and refine several existing modeling techniques such as schema-to-internal database mappings and models of internal database structures. All the proposed submodels are integrated within a generalized framework for database performance analysis, named PROPHET, with detailed models of computation at the four layers of abstraction. The applicability and feasibility of the framework is illustrated with applications to several present-day commercial DBMSs, including IMS, MISTRESS and SYSTEM 2000. We also describe the design and implementation of a specialization of our generalized framework to SYSTEM 2000 environments. We report preliminary results of the validation of the SYSTEM 2000 performance predictor against data measured in an actual operational environment. .] .[ AN University Microfilms Order Number ADG88-00136. AU CHIN, YU-TUNG. IN University of Rhode Island Ph.D 1987, 194 pages. TI SYSTEM DESIGN AND IMPLEMENTATION OF A COMPUTERIZED RHODE ISLAND WATER RESOURCES INFORMATION SYSTEM. DE Computer Science. AB Water resources data from various federal and state agencies of Rhode Island were collected, analyzed and reorganized. The nature of the data was studied. A conceptual model representing the entities and relationships among the data was derived by means of a semantic data model. The conceptual model was then mapped into a logical file structure represented by a pseudo Data Definition Language. A distributed configuration was chosen to implement the database in order to achieve desired functional performance and cost effectiveness. Finally, the logical structure was realized by appropriate Database Management Systems in mainframe and personal computers. Programs were developed to support data entry as well as to maintain integrity and security of the database. Applications of the system in water resources decision-making and management were demonstrated by examples. .] .[ AN This item is not available from University Microfilms International ADG05-62262. AU FALOUTSOS, CHRISTOS NICK. IN University of Toronto (Canada) Ph.D 1986. TI SIGNATURE FILES: AN ACCESS METHOD FOR TEXTUAL MESSAGES. DE Computer Science. AB Signature files seem to be a promising method for text retrieval and message retrieval. According to this method the messages are stored sequentially in one file ("text file") while abstractions of the messages ("signatures") are stored sequentially in another file ("signature file"). In order to resolve a query, the signature file is scanned first and many non-qualifying messages are immediately rejected. Those non-qualifying messages that pass the signature test are called false drops and are undesirable. Careful design of the signature extraction method and the appropriate signature size can restrict the average number of false drops. The screening capacity of some signature extraction methods is examined analytically. New methods are proposed and their performance is compared with one of the old methods. Finally, the relationship between the false drop probability and the information loss during the signature extraction is investigated. .] .[ AN This item is not available from University Microfilms International ADG05-61631. AU ICAZA, JOSE IGNACIO. IN University of Waterloo (Canada) Ph.D 1987. TI ADAPTIVE SELECTION OF QUERY PROCESSING STRATEGIES. DE Computer Science. AB Approximate cost models that are traditionally needed to evaluate query execution strategies are sometimes unavailable or too unreliable for particular environments. Our new approach for strategy selection is based on feeding back the actual costs of query execution under different strategies, rather than on assumptions-loaded estimates of these costs. We propose three different methods for using these feedback costs, under a common general framework for adaptive systems. In optimal selection we define the problem of designing an optimal decision policy for a single query class, where the policy specifies which strategy to execute for each query. This problem is mapped to some versions of the bandit problem in statistics, which can be solved using bayesian statistics and dynamic programming. We present and analyze a program that derives the optimal policy. In approximate selection we use a learning automaton to select strategies. The learning algorithm by Thathachar and Sastry is modified to handle the dynamic environments of databases. This adaptive algorithm is tested using existing databases and query loads, showing how this approach determines the best execution strategies over time, and changes which strategies are selected as the environment changes. Finally, the method of query class partitioning considers the problem of evolving optimal query partitionings, in which each partition includes all queries that have a common optimal strategy. We present a novel adaptive classification scheme that uses several learning automata. .] .[ AN University Microfilms Order Number ADG88-04134. AU KEARNS, TIMOTHY G. IN Air Force Institute of Technology Ph.D 1987, 351 pages. TI A METHODOLOGY, BASED ON ANALYTICAL MODELING, FOR THE DESIGN OF PARALLEL AND DISTRIBUTED ARCHITECTURES FOR RELATIONAL DATABASE QUERY PROCESSORS. DE Computer Science. AB The design of faster relational database query processors to improve the data retrieval capability of a database was the goal of this research. The emphasis was on evaluating the potential of parallel implementations to allow use of multiprocessing. First, the theoretical considerations of applying relational operations to distributed data was considered to provide an underlying data distribution and parallel processing environment model. Next, analytical models were constructed to evaluate various implementations of the select, project, and join relational operations and the update operations of addition, deletion, and modification for a range of data structures and architectural configurations. To predict the performance of the query processor for all cases, the individual operator models needed to be extended to models for complex queries consisting of several relational operations. The solution to modeling multi-step queries was the use of a general form to express a query. This normal form query tree uses combined operations to express relational algebra equivalent queries in a standard form. This standard tree form was then used to construct analytical models for multi-step queries. These models provide the capability to model the potential of different forms of parallelism in solving complex queries. The results of the analytical models present a logical design for a multiprocessor query processor. This logical query processor using multiple processors and employing parallelism illustrates the potential for an improved query processor using parallel processing when the analytical model results of complex queries are compared to a benchmark of some current database systems. .] .[ AN University Microfilms Order Number ADG88-01732. AU LUCCI, STEPHEN. IN City University of New York Ph.D 1987, 241 pages. TI STRING MATCHING: A COMPARATIVE STUDY OF ALGORITHMS, AND ITS RELATION TO PROBLEMS OF PARALLEL AND DISTRIBUTED COMPUTING. DE Computer Science. AB Our purpose in this work is to provide a comparative study of string matching algorithms and their relationship to problems in parallel and distributed computing. We begin our discussion in chapters one and two with the Knuth-Morris-Pratt and Boyer-Moore algorithms. These optimal serial approaches to string matching provide a reference point against which their parallel counterparts in chapters three through five may be judged. Galil's parallel algorithm in chapter three searches for prefixes of the pattern of increasing length which occur within the text string. Vishkin's Algorithm, described in chapter four employs the concept of duels between various text positions to obtain a sparse set of indices within the text in which the pattern might begin. In chapter five the Rabin-Karp Algorithm is outlined. This approach uses hashing functions called fingerprints, rather than character comparisons, as its basic operation. It is a serial algorithm which unlike other serial approaches is readily parallelizable. In chapter six we outline an architecture for a parallel processor for Galil's algorithm. The "parallel string processor" (PSP) is an SIMD computer with 2$\sp{16}$ PEs and a perfect shuffle interconnection network. In the process of specifying our design we consider sorting algorithms, connection networks, and various paradigms for data broadcasting. Hopefully, this work foreshadows the rapidly unfolding technological breakthroughs that will serve as a bridge to fifth generation systems. .] .[ AN This item is not available from University Microfilms International ADG03-75884. AU SHEU, CHEN-YU. IN University of California, Berkeley Ph.D 1986. TI QUERY PROCESSING AND COMMUNICATION CONTROL IN OBJECT BASES AND DISTRIBUTED OBJECT BASES. DE Computer Science. .] .[ AN This item is not available from University Microfilms International ADG05-62253. AU TUORI, MARTIN IAN. IN University of Toronto (Canada) Ph.D 1987. TI A FRAMEWORK FOR BROWSING IN THE RELATIONAL DATA MODEL. DE Computer Science. AB This thesis addresses problems in HCI (Human-Computer Interaction) with information systems, especially database management systems (DBMSs). Inexperienced or occasional users are often unable or unwilling to cope with the linguistic, structural and intentional demands of traditional interfaces, especially high-level query languages. The primal form of interaction is looking, or searching. A four part taxonomy of searching is given (intention, structure, language and modality), and used as a descriptive basis for a review of relevant HCI literature and interface approaches. The most important form of searching for casual use is browsing, defined here as searching without an explicit goal or intention. A three-part framework is presented for the design and evaluation of browsing interfaces--conceptual models of the interface, design guidelines, and evaluation criteria. A closed-loop model of the user interface is presented, along with a novel categorization of relational database interfaces, by the way a user may approach the information (data-model-oriented, object-oriented, and domain-oriented) and by structural level (system, database, table and tuple). Two selection methods for domain-oriented, table-level browsing are presented--multidimensional range query and point of view using nearest neighbours. Their strengths and weaknesses are examined, and the nearest-neighbours approach developed in detail. Problems arising from this method are examined, solutions offered, and an implementation supporting both these methods is presented and discussed. The descriptive taxonomy and development framework presented in this thesis offer a basis for understanding computer-assisted searching and for assessing the contributions of various interface approaches. Previous literature on HCI is based on the premise of well-defined intention; this premise must be rejected in favour of a model of HCI in which the user's intention is variable. Computerized information systems capable of supporting a wide range of users will result from the integration of many interface approaches; browsing offers a useful approach for initial, exploratory search, especially by inexperienced or occasional users. .] .[ AN University Microfilms Order Number ADG88-04620. AU TURBYFILL, CAROLYN. IN Cornell University Ph.D 1988, 286 pages. TI COMPARATIVE BENCHMARKING OF RELATIONAL DATABASE SYSTEMS. DE Computer Science. AB New and diverse architectures have been developed to meet rising expectations of both functionality and performance from relational database systems. A comprehensive, portable, comparative benchmark is needed to evaluate the performance tradeoffs inherent in these different architectures. In this thesis, we develop an experimental framework for systematically evaluating and comparing relational database systems across diverse architectures. We describe and motivate the design of a scalable, portable benchmark for relational database systems, the AS$\sp3$AP benchmark (ANSI SQL Standard Scalable And Portable). The AS$\sp3$AP benchmark is designed to provide meaningful measures of database processing power and to be a useful tool for system designers. We introduce a new performance metric, the equivalent database ratio, to be used in comparing systems. The equivalent database ratio is the ratio of database sizes on two different systems for which equivalent performance on a set of test queries is obtained. The benchmark consists of two parts: tests of the access methods and building blocks, and test of the optimizer. In tests of the access methods and building blocks, access methods and functions common to implementations of relational database systems are tested. The relations and queries used in these tests are designed to increase the likelihood that the access method or program branch targeted by the test will actually be chosen by the query optimizer. The tests of the access methods and building blocks are used to compute the equivalent database ratio. In tests of the optimizer, assumptions typically made by query optimizers such as the uniform distribution of data values or the random placement of tuples are systematically violated. Comparisons between queries are used to indicate whether the optimizer has made a correct choice. Queries are included to illustrate current limitations in the state of the art in query optimization. For a representative cross section of access methods, estimates of the best case response time for the benchmark queries on disk bound systems are computed. The response time estimates and the systematic benchmarking methodology provide for clear interpretation of benchmark results. .] [Note: continued in next issue - Ed] ------------------------------ END OF IRList Digest ********************