<?xml version="1.0" encoding="UTF-8"?>

<etd>
	<front>
		<title bbox="[237.66, 390.49, 1461.25, 481.52]" conf="0.85" pg_no="0">Analyzing and Navigating Electronic Theses and Dissertations</title>
		<author bbox="[739.05, 553.54, 959.07, 610.75]" conf="0.74" pg_no="0">Aman Ahuja</author>
		<university bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
		<degree bbox="[411.68, 690.48, 1288.93, 1169.45]" conf="0.42" pg_no="0">Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications</degree>
		<committee bbox="[660.07, 1242.06, 1038.53, 1573.78]" conf="0.7" pg_no="0">Edward A. Fox, Chair Chris North Lifu Huang Eugenia Rho Wei Wei</committee>
		<date bbox="[728.08, 1658.18, 968.88, 1714.38]" conf="0.49" pg_no="0">June 12, 2023</date>
		<abstracts>
			<abstract>
				<abs_heading bbox="[166.08, 763.95, 1530.22, 2030.08]" conf="0.4" pg_no="1">Abstract</abs_heading>
				<abs_text>Electronic Theses and Dissertations (ETDs) contain valuable scholarly information that can be of immense value to the scholarly community. Millions of ETDs are now publicly available online, often through one of many digital libraries. However, since a majority of these digital libraries are institutional repositories with the objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of long PDF documents easier. In recent years, with advances in the field of machine learning for text data, several techniques have been proposed to support such end-user services. However, limited research has been conducted towards integrating such techniques with digital libraries. This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, as well as to support end-user services for digital libraries, such as document browsing and long document navigation. First, we review several machine learning models that can be used to support such services. Next, to support a comprehensive evaluation of different models, as well as to train models that are tailored to the ETD data, we introduce several new datasets from the ETD domain. To minimize the resources required to develop high quality training datasets required for supervised training, a novel AI-aided annotation method is also discussed. Finally, we propose techniques and frameworks to support the various digital library services such as search, browsing, and recommendation.
The key contributions of this research are as follows:
• A system to help with parsing long scholarly documents such as ETDs by means of
object-detection methods trained to extract digital objects from long documents. The
parsed documents can be used for further downstream tasks such as long document
navigation, figure and/or table search, etc.
• Datasets to support supervised training of object detection models on scholarly doc-
uments of multiple types, such as born-digital and scanned. In addition to manually
annotated datasets, a framework (along with the resulting dataset) for AI-aided anno-
tation also is proposed.
• A web-based system for information extraction from long PDF theses and dissertations,
into a structured format such as XML, aimed at making scholarly literature more
accessible to users with disabilities.
• A topic-modeling based framework to support exploration tasks such as searching
and/or browsing documents (and document portions, e.g., chapters) by topic, docu-
ment recommendation, topic recommendation, and describing temporal topic trends.</abs_text>
			</abstract>
		</abstracts>
		<tocs>
			<tocs>
				<toc_heading bbox="[173.95, 276.97, 521.31, 379.57]" conf="0.85" pg_no="6">Contents</toc_heading>
				<toc_text>List of Figures xiii
List of Tables xv
1 Introduction 2
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Author’s Prior Work and Publications . . . . . . . . . . . . . . . . . . . . . 7
2 Review of Literature 10
2.1 Document Layout Analysis: Datasets . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Document Layout Analysis: Annotation Methods . . . . . . . . . . . . . . . 10
2.3 Document Layout Analysis: Techniques . . . . . . . . . . . . . . . . . . . . . 11
2.4 Analysis of ETDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Parsing Long PDF Documents Using Object Detection 14
3.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 ETD Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 List of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Main Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Dataset Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Element Extraction using Object Detection . . . . . . . . . . . . . . 20
3.4.3 Post-Processing Extracted Objects . . . . . . . . . . . . . . . . . . . 22
3.5 Object Detection Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.2 Analysis of Various Object Detection Models Trained on ETD-OD . 253.6.3 Analysis of Detection Performance on Different Object Categories . . 26
3.6.4 Comparison against Other Layout Detection Datasets . . . . . . . . . 26
4 Augmentation-Based Training for Layout Analysis Models 28
4.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Types of Image Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 Brightness and Contrast . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Erosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.4 Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.5 Downscale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.6 Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.7 Salt and Pepper Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.8 Random Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2 Layout Detection of Digital ETDs . . . . . . . . . . . . . . . . . . . 32
4.4.3 Layout Detection of Scanned ETDs . . . . . . . . . . . . . . . . . . . 33
4.4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 AI-Aided Annotation for Developing Layout Analysis Datasets 36
5.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Proposed AI-aided Annotation Scheme . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Dataset Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Weak Labels Using Pre-Trained Model . . . . . . . . . . . . . . . . . 40
5.2.3 Optional Filtering for Specific Object Classes . . . . . . . . . . . . . 41
5.2.4 Manual Verification and Correction . . . . . . . . . . . . . . . . . . . 41
5.3 ETD-ODv2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.1 Scanned Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.2 Page Images with Minority Elements . . . . . . . . . . . . . . . . . . 43
5.3.3 Dataset Source and Object Classes . . . . . . . . . . . . . . . . . . . 44
5.3.4 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.1 Annotation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.2 Object Detection Performance . . . . . . . . . . . . . . . . . . . . . . 48
6 Structured Representations of Long Scholarly Documents 53
6.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 XML Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 XML Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3.1 Identifying Delimiters . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.2 Linking Figures and Tables with their Captions . . . . . . . . . . . . 58
6.3.3 Linking Equations and Equation Numbers . . . . . . . . . . . . . . . 59
6.4 PDF to HTML Browser for Improved Accessibility . . . . . . . . . . . . . . 59
6.4.1 User-friendly View of Long Documents . . . . . . . . . . . . . . . . . 59
6.4.2 Improved Accessibility for Those with Disabilities . . . . . . . . . . . 60
6.5 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5.1 Side Bar for Navigation . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.2 PDF View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5.3 Document View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Topic Modeling based System for Analyzing and Browsing ETDs 63
7.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2.2 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.3 User Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 System Setup and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3.1 Dataset and System Details . . . . . . . . . . . . . . . . . . . . . . . 69
7.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3.3 Comparison of Different Topic Models . . . . . . . . . . . . . . . . . 70
7.4 Integrating ETD-Topics with Other End-User Services . . . . . . . . . . . . 71
7.4.1 Overview of Information Retrieval Systems . . . . . . . . . . . . . . . 71
7.4.2 Integrating Document Recommendation with Document Retrieval . . 72
7.4.3 Extending Topic Modeling from Documents to Chapters . . . . . . . 73
7.5 Further Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Conclusion 75
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2 Summary of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography 79</toc_text>
			</tocs>
			<tocs>
				<toc_heading bbox="[169.57, 273.39, 713.16, 386.86]" conf="0.86" pg_no="12">List of Figures</toc_heading>
				<toc_text>1.1 An overview of the different chapters, along with their key components, as
proposed in this thesis. (* indicates the components are largely beyond the
scope of this thesis.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Architecture of the proposed object detection based parsing framework. . . . 21
3.2 Examples of outputs generated by the Faster-RCNN* and YOLOv7 models. 27
4.1 An example of a page with its augmented versions. . . . . . . . . . . . . . . 35
5.1 Examples of pages from scanned documents. . . . . . . . . . . . . . . . . . . 37
5.2 An illustration showing a page from a scanned document, the annotations
generated by an object detection model trained on a small dataset, and the
final annotations after correction by a human annotator. . . . . . . . . . . . 38
5.3 Architecture of the proposed AI-aided annotation framework. . . . . . . . . 39
5.4 Annotation time for each annotator under different annotation settings. . . . 47
6.1 An overview of the PDF to XML to HTML system. . . . . . . . . . . . . . . 61
7.1 An overview of ETD-Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 A snapshot of different user services. (a) Documents per Topic Distribution
and Topic List, (b) Similar Topics and Topic Specific Documents for one topic,
(c) Document page showing Related Topics and Similar Documents for one
document, (d) Trend Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Integrating search engine module with topic models from ETD-Topics frame-
work for document recommendation. An example of a search query, its search
results returned by a BM25 based search engine, and recommended documents
for one highlighted document are shown. . . . . . . . . . . . . . . . . . . . . 72</toc_text>
			</tocs>
			<tocs>
				<toc_heading bbox="[169.32, 272.06, 679.96, 383.77]" conf="0.86" pg_no="14">List of Tables</toc_heading>
				<toc_text>3.1 Distribution of different object categories in our dataset.
Note: Some of
the documents were accompanied with front matter (metadata) pages that are
sometimes generated by the digital libraries. We include annotations for such
documents as well, and hence, the number of metadata elements does not
. . . . . . . . . . . . . . . . . . . . 19
exactly match the number of documents.
3.2 mAP comparison for object detection models on ETD-OD. Faster-RCNN*
represents the model pre-trained on DocBank and fine-tuned on ETD-OD.
Underlined values indicate best performing models. . . . . . . . . . . . . . . 25
3.3 AP@0.5 values for different object categories for YOLOv7 (Abs. = Abstract,
LOC = List of Contents). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 AP@0.5 values for categories supported by DocBank using Faster-RCNN
trained on different datasets and evaluated on the validation set of ETD-OD.
For we list the Figure Caption / Table Caption values for models
Caption,
trained on ETD-OD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 mAP scores of two different versions of YOLOv7 on test set consisting of
digital ETDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 mAP scores of two different versions of YOLOv7 on test set consisting of
scanned ETDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 ETD-ODv2 dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Distribution of the test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Statistics of different versions of the data set used for training. . . . . . . . . 49
5.4 Object detection performance results. . . . . . . . . . . . . . . . . . . . . . . 50
7.1 Quantitative comparison of different models, with underlined values indicat-
ing best performing models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Corresponding words for a topic from different models. . . . . . . . . . . . . 70</toc_text>
			</tocs>
		</tocs>
	</front>
	<body>
		<chapter>
			<title bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs>
					<para bbox="[166.08,763.95,1530.22,2030.08]" conf="0.93" pg_no="3">Electronic Theses and Dissertations (ETDs) contain valuable scholarly information that can be of immense value to the research community. Millions of ETDs are now publicly available online, often through one of many online digital libraries. However, since a majority of these digital libraries are institutional repositories with the objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of long PDF documents easier and accessible. Several advances in the field of machine learning for text data in recent years have led to the development of techniques that can serve as the backbone of such end-user services. However, limited research has been conducted towards integrating such techniques with digital libraries. This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, by parsing the information contained in the long PDF documents that make up ETDs, into a more compute-friendly format. This would enable researchers and developers to build end-user services for digital libraries. We also propose a framework to support document browsing and long document navigation, which are some of the important end-user services required in digital libraries.</para>
					<para bbox="[164.01, 500.73, 1534.28, 1970.0]" conf="0.94" pg_no="4">I would like to express my heartfelt gratitude and appreciation to the following individuals
and organizations who have played a pivotal role in the completion of this thesis:
First and foremost, I am deeply indebted to my advisor, Dr. Edward A. Fox, for his unwaver-
ing guidance, invaluable insights, and continuous support throughout the research process.
His expertise and dedication have been instrumental in shaping the direction and quality of
this work.
I extend my sincere thanks to the members of my thesis committee, Dr. Chris North, Dr.
Lifu Huang, Dr. Eugenia Rho, and Dr. Wei Wei, for their valuable feedback, construc-
tive criticism, and scholarly contributions. Their expertise and scholarly perspectives have
greatly enriched the content of this thesis.
I am grateful to the undergraduate students, Alan Devera, Kecheng Zhu, Jiangyue Li,
Zachary Gager, You Peng, Shelby Neal, Andrew Leavitt, Annie Tran, Brian Dinh, Kevin
Dinh, Jiayue Lin, Kevin Liu, Mingkai Pang, Theodore Gunn, Zehua Zhang, Luke Wev-
ley, Michael Nader, Elizabeth Keegan, and Gabrielle Nguyen, as well as graduate students,
Chenyu Mao and Nirmal Amirthalingam, who have actively participated in this research.
Their dedication, hard work, and insightful discussions have significantly contributed to the
success of this thesis.
I would also like to express my thanks to William A. Ingram and the University Libraries
for their invaluable support throughout the research process. Their resources, access to ma-
terials, and assistance in navigating academic databases have been indispensable.
A special mention goes to the members of the Digital Library Research Laboratory for their
technical support and collaboration. Their expertise, assistance, and helpful discussions have</para>
					<para bbox="[170.22, 278.76, 1532.32, 842.38]" conf="0.8" pg_no="5">contributed to the development and refinement of the research methodology and implemen-
tation.
I am deeply grateful to my family and friends for their unwavering support, encouragement,
and understanding throughout this journey. Their belief in my abilities and constant moti-
vation have been crucial in overcoming challenges and staying focused on the thesis.
While the list of individuals mentioned here is not exhaustive, each and every one of them
has played a significant role in the completion of this thesis. I am truly grateful for their
support and contributions.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[163.78, 265.61, 666.03, 604.75]" conf="0.79" pg_no="17">Chapter 1
Introduction</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.63, 735.89, 992.22, 824.15]" conf="0.8" pg_no="17">1.1 Background and Motivation</name>
				<paragraphs>
					<para bbox="[170.9, 932.13, 1527.7, 2020.19]" conf="0.95" pg_no="17">Scholarly documents like ETDs contain important research findings, which are of value to a
diverse group of users from the scholarly community. Examples of such users include students
and researchers who want to review work related to their research area, as well as librarians
and university administrators who want an overview of recent research in their institutions.
With the vast amount of research being conducted across a variety of domains, millions of
ETDs are now publicly available online. However, digital library services for ETDs have not
evolved past simple search and browse at the metadata level, thus rendering the vast amount
of information from these documents underutilized.
In recent years, advances have been made in NLP-based techniques such as topic modeling,
question-answering and text summarization, which might be incorporated to make ETDs
more accessible. However, a majority of these documents exist as PDF files, and are often
long and filled with highly specialized details. While some tools can work with these files,
the results we have observed have been poor; other tools require data in a structured format
such as XML. Accordingly, there is a need to build electronic infrastructure that can leverage
the rich scholarly information contained within ETDs and make it accessible to the wider
community.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[173.9, 275.32, 796.45, 366.23]" conf="0.79" pg_no="18">1.2 Problem Statement</name>
				<paragraphs>
					<para bbox="[169.68, 482.93, 1529.45, 2027.37]" conf="0.89" pg_no="18">This thesis aims to develop methodologies that can support making the knowledge contained
in ETDs more accessible for digital library users. Although a comprehensive digital library
system should ideally support multiple end-user services, like browsing, search, retrieval, and
question-answering, the foremost requirement for any such service is to have data available
in a structured, machine-friendly format such as XML. Hence, a major contribution of this
thesis would be a framework to parse ETDs in PDF to structured formats like XML. The
parsed document can then be employed for training models for supporting end-user services.
It can also be helpful in making long documents more accessible by breaking them down into
multiple smaller components like chapters and sections. Moreover, structured representations
such as XML can be used to develop web-based systems, which have better compatibility with
accessibility tools such as on-screen readers, thus allowing those with disabilities to access this
information. Given the recent success of object detection models in document layout analysis,
we will take the object detection approach for this work. We will develop methodologies
that can address several challenges that arise in the process of parsing long PDF documents.
These include limited availability of training data, heterogeneity in document types, the
imbalanced number of elements in the various classes, and the resource-intensive nature of
dataset annotation. Unfortunately, there is no mechanism for parsing extracted objects to
determine relationships among them, and converting them into a structured format to make
them accessible to users with special needs.
We also will investigate how this parsed information can be used for downstream tasks. Re-
garding the scope of this work, we will focus on techniques that can be helpful in navigating
and browsing documents from a digital library. For example, consider that the most intuitive
way of making a browsing system is to group items by categories. Users can select a cate-</para>
					<para bbox="[176.53, 287.2, 1514.39, 828.21]" conf="0.86" pg_no="19">gory of their preference and browse the respective documents. However, in case of scholarly
documents, grouping documents by research areas is a non-intuitive task, since many doc-
uments only contain subject/department level information, which is often very high level.
It is hard to classify documents based on pre-defined categories, due to the absence of a
unified list of categories and the datasets essential for training such models. Hence, we will
study unsupervised methods such as topic modeling for this task. The resultant topics or
categories, as well as their respective documents, can then be used for supporting document
browsing by research area in a digital library.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[178.92, 936.03, 833.14, 1023.41]" conf="0.8" pg_no="19">1.3 Research Hypotheses</name>
				<paragraphs>
					<para bbox="[184.06, 1072.13, 1533.95, 2042.39]" conf="0.94" pg_no="19">The central hypotheses of this research are listed below:
• Object detection based document layout analysis methods for long scholarly docu-
H1:
ments, trained on high quality domain-specific labeled data, perform better than those
trained on a larger dataset originating from other related domains, such as research
papers.
• Pre-training on other scholarly datasets, albeit from a different domain such as
H2:
research papers, improves the performance of document layout analysis methods on
long scholarly documents such as ETDs.
• Training on derived datasets, such as augmented versions of the original training
H3:
data, can significantly improve the performance of layout analysis models.
• To perform well on other document types, such as scanned documents, models
H4:
trained on a specific type of documents, such as born-digital ones, require additional
training using techniques, like augmentation, that help bridge the distribution gap.
• AI-aided annotation methods, such as using models trained on existing smaller
H5:</para>
					<para bbox="[196.17, 278.2, 1530.96, 975.1]" conf="0.94" pg_no="20">datasets to extract weak labels for unlabeled data, reduce the resources required for
annotating additional data.
• Models trained on datasets with skewed distributions in terms of class labels
H6:
achieve better performance on minority classes when trained on additional data from
those classes, such as from AI-aided annotation methods.
• Combining the predictive power of AI models with rules formulated based on
H7:
domain expertise possessed by humans reduces errors in predictive tasks such as doc-
ument structure parsing.
• Neural topic models can outperform other traditional topic models, such as LDA,
H8:
while doing topic modeling on scholarly documents such as ETDs and their chapters.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.01, 1069.89, 796.24, 1155.6]" conf="0.79" pg_no="20">1.4 Research Questions</name>
				<paragraphs>
					<para bbox="[179.39, 1201.04, 1531.65, 2034.73]" conf="0.95" pg_no="20">Based on the hypotheses listed above, the work proposed in this thesis will focus on the
following research questions:
• What are the different elements that are important in an ETD that can be helpful
R1:
for training machine learning models for downstream tasks like searching, browsing,
question-answering, etc.? How can we develop a dataset that can support training
supervised machine learning models to extract these elements from an ETD?
• Are datasets from other related domains, such as research papers, sufficient to
R2:
train layout analysis methods for ETDs? How can these datasets benefit layout analysis
methods for ETDs, when used in conjunction with domain specific datasets?
• What type of augmentation strategies can be used to derive more training data
R3:
for object detection models? How can we use augmented datasets to improve the
performance of object detection models?</para>
					<para bbox="[187.72, 273.41, 1532.9, 1169.97]" conf="0.94" pg_no="21">• Can document analysis methods trained on documents of one type, such as digital
R4:
PDF documents, facilitate document analysis on other types of documents, such as
scanned documents?
• How can annotation methods utilize the power of models trained on existing
R5:
datasets, to reduce the resources required in the annotation process?
• How can we improve the performance of machine learning models, especially on
R6:
minority classes, using datasets developed using AI-aided annotation?
• How can domain expertise, such as a set of rules about syntax and structure
R7:
that are known to domain experts, be used to develop a set of post-processing rules,
which when used in combination with machine learning methods, improve the process
of document layout analysis?
• Can neural topic models outperform traditional topic models such as LDA, on
R8:
commonly used topic evaluation metrics, such as coherence and topic diversity?</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[173.78, 1266.74, 850.25, 1355.53]" conf="0.77" pg_no="21">1.5 Overview of Chapters</name>
				<paragraphs>
					<para bbox="[182.6, 1398.57, 1529.02, 2040.15]" conf="0.95" pg_no="21">Figure 1.1 gives a high-level overview of different chapters proposed in this thesis, along with
their respective contributions. The rest of this research is organized as follows:
• Chapter 2 outlines some of the important techniques and datasets related to the work
proposed in this thesis.
• Chapter 3 introduces a list of important elements commonly found in ETDs, and a
new dataset for training document layout analysis models on ETDs. It also describes
training for object detection, and an evaluation of related models.
• Chapter 4 proposes an augmentation-based training approach for object detection mod-
els. Experimental results showing how co-training on augmented data alongside orig-</para>
					<para bbox="[198.55, 276.31, 1533.05, 1034.89]" conf="0.9" pg_no="22">inal data can improve the performance of object detection models on layout analysis,
are also presented.
• Chapter 5 introduces an AI-assisted framework for annotating object detection data
to improve the performance of layout analysis methods on minority classes. A new
dataset to support layout analysis of scanned ETDs, as well as to improve extraction
of low-frequency elements such as metadata and algorithms, is also presented.
• Chapter 6 proposes a parsing framework to generate structured representations of long
scholarly documents using the set of objects derived from object detection models.
• Chapter 7 introduces a framework for utilizing the elements extracted from document
layout parsing, for downstream tasks such as browsing and recommendation, by means
of topic modeling.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.94, 1134.51, 1231.12, 1219.98]" conf="0.84" pg_no="22">1.6 Author’s Prior Work and Publications</name>
				<paragraphs>
					<para bbox="[178.69, 1264.59, 1533.05, 2049.97]" conf="0.87" pg_no="22">During the course of the doctoral program, the author of this proposal has published several
peer-reviewed papers in the domains of retrieval and ranking, question-answering, and topic
modeling. These are listed below. Entries 1-4 relate closely to this dissertation, and the
contributions of the co-authors thereof are hereby acknowledged.
1. Satvik Chekuri, Prashant Chandrasekar, Bipasha Banerjee, Sung Hee Park, Nila Mas-
rourisaadat, William A. Ingram and Edward A. Fox, “Integrated Dig-
Aman Ahuja,
ital Library System for Long Documents and their Elements.”
In Proceedings of the
23rd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2023).
2. Kevin Dinh, Brian Dinh, William A. Ingram, and Edward A. Fox.
Aman Ahuja,
“A New Annotation Method and Dataset for Layout Analysis of Long Documents.”
In Companion Proceedings of the ACM Web Conference 2023, pp. 834-842. 2023,</para>
<para bbox="[196.19, 1485.07, 1533.52, 2043.9]" conf="0.4" pg_no="23">https://doi.org/10.1145/3543873.3587609.
3. Alan Devera, and Edward A. Fox, “Parsing Electronic Theses and
Aman Ahuja,
Dissertations Using Object Detection.”
In Proceedings of the First Workshop on In-
formation Extraction from Scientific Publications (WIESP 2022, held in conjunction
https://aclanthology.org/2022.wiesp-1.14.pdf.
with AACL-IJCNLP 2022),
4. Chenyu Mao, William A. Ingram, and Edward A. Fox, “Analyzing
Aman Ahuja,
and Navigating ETDs Using Topic Models.”
In The Journal of Electronic Theses and
(to appear).
Dissertations</para>
		<para bbox="[196.4, 285.36, 1548.32, 1891.6]" conf="0.94" pg_no="24">5. Ming Zhu, Da-Cheng Juan, Wei Wei, and Chandan K. Reddy. “Ques-
Aman Ahuja,
tion Answering with Long Multiple-Span Answers.”
In Findings of the Association for
pp. 3840-3849. 2020.
Computational Linguistics: EMNLP 2020,
6. Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K.
Aman Ahuja,
Reddy. “Language-Agnostic Representation Learning for Product Search on E-Commerce
Platforms.”
In Proceedings of the 13th International Conference on Web Search and
pp. 7-15. 2020.
Data Mining,
7. Xuan Zhang, Zhilei Qiao, Weiguo Fan, Edward A. Fox, and Chandan
Aman Ahuja,
K. Reddy. “Discovering Product Defects and Solutions from Online User Generated
Contents.” pp. 3441-3447. 2019.
In The World Wide Web Conference,
8. Ming Zhu, Wei Wei, and Chandan K. Reddy. “A Hierarchical Atten-
Aman Ahuja,
tion Retrieval Model for Healthcare Question Answering.”
In The World Wide Web
pp. 2472-2482. 2019.
Conference,
9. Ashish Baghudana, Wei Lu, Edward A. Fox, and Chandan K. Reddy.
Aman Ahuja,
“Spatio-Temporal Event Detection from Multiple Data Sources.”
In Pacific-Asia Con-
pp. 293-305. Springer, Cham,
ference on Knowledge Discovery and Data Mining,
2019.
10. Vineeth Rakesh, Weicong Ding, Nikhil Rao, Yifan Sun, and Chandan
Aman Ahuja,
K. Reddy. “A Sparse Topic Model for Extracting Aspect-Specific Summaries from
Online Reviews.” pp. 1573-
In Proceedings of the 2018 World Wide Web Conference,
1582. 2018.
11. Wei Wei, Wei Lu, Kathleen M. Carley, and Chandan K. Reddy. “A
Aman Ahuja,
Probabilistic Geographical Aspect-Opinion Model for Geo-tagged Microblogs.”
In 2017
pp. 721-726. IEEE, 2017.
IEEE International Conference on Data Mining (ICDM),</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[229.11, 319.22, 1473.1, 1273.78]" conf="0.93" pg_no="23">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_23_9897.jpg</path>
						<caption bbox="[192.05, 1285.28, 1512.34, 1431.86]" conf="0.86" pg_no="23">Figure 1.1: An overview of the different chapters, along with their key components, as
proposed in this thesis. (* indicates the components are largely beyond the scope of this
thesis.)</caption>
					</figure>
				</figures>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[168.49, 268.0, 935.7, 606.33]" conf="0.89" pg_no="25">Chapter 2
Review of Literature</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.41, 736.14, 1207.03, 821.97]" conf="0.86" pg_no="25">2.1 Document Layout Analysis: Datasets</name>
				<paragraphs>
					<para bbox="[175.64, 858.64, 1527.56, 1545.15]" conf="0.91" pg_no="25">With the growing interest in using object detection based methods for document layout anal-
ysis, several datasets have been introduced. Many of these datasets focus on specific object
types. For instance, TableBank [23], ScanBank [19], and MFD [3] consist of tables, figures,
and equations, respectively. Several datasets that consist of a diverse set of objects have
also been introduced. HJDataset [34] consists of historical Japanese documents. PRImA [4]
consists of document images from magazines and research papers. PubLayNet [47] is based
on PDF articles from PubMed Central. The number of different objects, however, is limited
in these datasets. DocBank [24] is a large dataset that consists of a diverse set of objects
from research papers. But given the differences between research papers and long documents
such as ETDs, models trained on DocBank do not generalize well to ETDs.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[186.03, 1624.03, 1493.77, 1712.81]" conf="0.83" pg_no="25">2.2 Document Layout Analysis: Annotation Methods</name>
				<paragraphs>
					<para bbox="[182.12, 1753.81, 1514.96, 2028.62]" conf="0.94" pg_no="25">Due to the intensive nature of dataset annotation in terms of time and cost, researchers have
proposed several techniques to annotate training datasets for object detection models. For
PDF documents with an accompanying MS-Word, XML, or LaTeX file, automatic extraction
based on tags is possible [23, 24]. However, in the case of scanned documents, existing rule-</para>
					<para bbox="[175.14, 275.64, 1518.28, 508.93]" conf="0.9" pg_no="26">based approaches do not yield high-quality results. In such cases, techniques have been
explored that can help annotators, or guide them in annotating samples about which the
model is most uncertain [48].</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.29, 729.68, 1265.64, 822.39]" conf="0.84" pg_no="26">2.3 Document Layout Analysis: Techniques</name>
				<paragraphs>
					<para bbox="[169.46, 890.76, 1529.52, 2032.56]" conf="0.96" pg_no="26">Early works in the domain of document layout understanding used rule-based approaches
[14, 22]. Other approaches, e.g., GROBID [26] and CERMINE [38], designed for parsing
scientific documents, primarily focused on short documents such as research papers, and
use an ensemble of sequence labeling methods for document parsing. With the advent of
deep-learning based object detection methods such as Fast-RCNN [12], Faster-RCNN [32],
and YOLO [30, 40], document layout analysis based on object detection has been proposed.
LayoutParser [35] uses object detection models that have been pre-trained on different ob-
ject detection datasets to support layout understanding. However, since it primarily uses
research-paper based datasets, it doesn’t perform well on ETDs. Moreover, the number
of object types it supports is very limited. More recently, layout-based language models
[17, 45, 46] have been proposed. This line of work uses a multimodal architecture, i.e., a
combination of visual and textual features, to pre-train the model on a large corpus of un-
labeled data consisting of document images and their corresponding text. Although these
models can then be fine-tuned on other downstream tasks such as object detection, they still
require domain-specific annotated data for fine-tuning. Recently, to make the documents
more accessible, services such as SciA11y [41] have been developed. However, their scope is
limited to research papers, rather than long documents such as books and ETDs.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[174.66, 276.32, 748.07, 363.24]" conf="0.81" pg_no="27">2.4 Analysis of ETDs</name>
				<paragraphs>
					<para bbox="[173.98, 403.02, 1522.39, 828.79]" conf="0.95" pg_no="27">With the growing number of ETDs that are publicly available on the web, techniques aimed
at analyzing ETDs have also gained interest in the research community. [39] proposes a
framework for automatic crawling of ETDs from public repositories, as well as the resultant
corpus of ETDs. An important line of work in the analysis of ETDs aims to extract elements,
such as metadata [8, 9], URLs [33], etc. [29] proposes an XML schema for ETDs in a digital
library.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[178.5, 934.44, 702.86, 1021.83]" conf="0.82" pg_no="27">2.5 Topic Modeling</name>
				<paragraphs>
					<para bbox="[171.61, 1073.39, 1529.22, 2039.6]" conf="0.94" pg_no="27">Topic modeling has been widely studied in the domain of text mining to discover latent topics.
One of the earliest methods to discover topics in text documents was probabilistic Latent
Semantic Indexing (pLSI) [16]. However, since pLSI was based on the likelihood principle
and did not have a generative process, it cannot assign probabilities to new documents. This
was alleviated by Latent Dirichlet Allocation (LDA) [6], which models each document as a
mixture over topics, and topics as a mixture over words.
With advances in the field of deep learning, neural topic models have gained increasing
interest. Neural Variational Document Model (NVDM) [27] is a neural topic model that uses
an unsupervised generative model based on Variational Autoencoders (VAE) [21]. Several
other topic models that use a VAE-based architecture have been proposed [10, 28, 36].
More recently, pre-trained language models like BERT [20] and RoBERTa [25] have shown
significant performance improvements in many NLP-related tasks due to their ability to learn
contextualized representations of text. Consequently, several topic models that incorporate
the representations from pre-trained language models have been proposed. BERTopic [13]</para>
					<para bbox="[171.57, 282.98, 1521.21, 766.05]" conf="0.89" pg_no="28">uses a clustering-based approach to first cluster documents based on their language model
extracted representations, and then extracts the most representative words, i.e., topics, for
each cluster using a TF-IDF based approach. In this process, however, the topics are not
learnt, and are rather extracted using a post-processing mechanism. Contextualized Topic
Model (CTM) [5] proposed an end-to-end learnable architecture that uses language model
derived representations from Sentence-BERT [31] along with bag-of-words embeddings, in a
VAE-based architecture similar to ProdLDA [36].</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[168.35, 259.88, 1486.46, 744.62]" conf="0.91" pg_no="29">Chapter 3
Parsing Long PDF Documents Using
Object Detection</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.81, 876.13, 764.21, 958.39]" conf="0.81" pg_no="29">3.1 Chapter Overview</name>
				<paragraphs>
					<para bbox="[174.03, 1006.2, 1521.22, 1429.59]" conf="0.9" pg_no="29">In this chapter, we propose a set of elements in an ETD that are important for downstream
tasks like searching, browsing, question-answering, etc. We also introduce ETD-OD, a new
object detection dataset that contains over 25K page images originating from 200 ETDs,
consisting of elements that can be important sources of information in an ETD. Finally, we
investigate the performance of various state-of-the-art object detection models for document
layout understanding on ETDs using the proposed dataset.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[181.75, 1551.69, 674.78, 1633.79]" conf="0.82" pg_no="29">3.2 ETD Elements</name>
				<paragraphs>
					<para bbox="[176.96, 1680.97, 1520.91, 2028.76]" conf="0.94" pg_no="29">Historically, ETDs do not conform to a universally accepted format, since different colleges
and universities have their own specific standards and requirements for ETDs. In this section
we discuss the elements that are typically found in ETDs and would be important to extract
for further analysis and downstream tasks. This list was curated after extensive discussions
with digital librarians and researchers. We broadly categorize the different elements of ETDs</para>
					<para bbox="[180.32, 285.2, 1313.38, 368.81]" conf="0.68" pg_no="30">into the following two-level taxonomy, i.e., set of broad and narrower classes.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.2, 510.66, 530.08, 579.47]" conf="0.78" pg_no="30">3.2.1 Metadata</name>
				<paragraphs>
					<para bbox="[174.94, 628.61, 1527.26, 1276.06]" conf="0.94" pg_no="30">The metadata consists of elements that contain unique identifiable information about an
ETD, including information found on the front page. Key metadata elements are:
• The main title of the document.
Title:
• Name of the document author.
Author:
• Date (or month/year) when the document was published.
Date:
• University/institution of the author.
University:
• Committee that approved the document, e.g., the student’s graduate com-
Committee:
mittee.
• Degree (e.g., Master of Science, Doctor of Philosophy) being earned.
Degree:</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.54, 1411.95, 515.31, 1479.34]" conf="0.77" pg_no="30">3.2.2 Abstract</name>
				<paragraphs>
					<para bbox="[168.78, 1528.16, 1525.33, 2034.59]" conf="0.93" pg_no="30">The abstract is an important element of an ETD, as it contains a summary of the work,
typically about a page long. Its elements include:
• Since many ETDs contain multiple abstracts, such as a technical
Abstract Heading:
abstract and general audience abstract, or an abstract in English as well as the original
language, extracting the abstract heading makes it easier to segment, and could be helpful
in categorizing the abstract by audience type.
• The actual text of the abstract.
Abstract Text:</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[186.08, 287.96, 658.23, 357.25]" conf="0.81" pg_no="31">3.2.3 List of Contents</name>
				<paragraphs>
					<para bbox="[175.14, 389.48, 1527.82, 898.67]" conf="0.93" pg_no="31">The list of contents (also referred to as table of contents) of an ETD determines where
different components are located based on their page numbers. This helps with accurately
mapping the chapters and sections, as well as figures and tables, since they are generally
included in the list of contents. This subcategory includes the following elements:
• This helps identify the specific type of list (e.g., list of
List of Contents Heading:
chapters/sections, list of figures, list of tables).
• This is the actual list of entries for this type of content.
List of Contents Text:</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[179.39, 963.35, 623.26, 1033.95]" conf="0.76" pg_no="31">3.2.4 Main Content</name>
				<paragraphs>
					<para bbox="[159.79, 1074.19, 1532.1, 2043.55]" conf="0.94" pg_no="31">Chapters are one of the most important components of an ETD, as they contain detailed
information about the research described in the document. This subcategory consists of
elements that can typically be found in the chapters of an ETD.
• The title of the chapter.
Chapter Title:
• Quite often, chapters themselves can be long. It may be desirable to have
Section:
further delimiters such as sectional headers. Hence, we include the section names (along
with other identifiers such as numbers) which can be used for further splitting of the
document.
• The main textual content of the ETD.
Paragraph:
• This includes figures, charts, and other visual illustrations included in the doc-
Figure:
ument.
• The text caption that describes the figure.
Figure Caption:
• The table element category.
Table:
• The text caption that describes the table.
Table Caption:</para>
					<para bbox="[159.33, 272.32, 1527.06, 904.54]" conf="0.93" pg_no="32">• Mathematical equation.
Equation:
• Quite often, equations are numbered, which can be helpful in
Equation Number:
linking them to the list of equations that may be included in the document.
• Algorithms, such as pseudo-code.
Algorithm:
• We separate footnotes from regular paragraphs, as they typically provide
Footnote:
auxiliary information which might be undesirable in many downstream tasks, such as
summary generation.
• Page numbers, which could be helpful in cross-referencing pages and
Page Number:
the objects contained therein to the list of contents.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[181.72, 1030.46, 595.72, 1102.63]" conf="0.83" pg_no="32">3.2.5 Bibliography</name>
				<paragraphs>
					<para bbox="[176.2, 1146.78, 1522.99, 1604.71]" conf="0.94" pg_no="32">We also include bibliographic elements in the list of objects. They are described below:
• The header that indicates start of the references list.
Reference Heading:
• The actual list of references cited in the document.
Reference Text:
In our dataset, we regard appendices as chapters, since they contain many elements that are
found in the main chapters. They can however, be easily differentiated from main chapters
based on the title.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[180.05, 1752.02, 508.89, 1827.24]" conf="0.8" pg_no="32">3.3 Dataset</name>
				<paragraphs>
					<para bbox="[184.93, 1885.05, 1511.07, 2029.68]" conf="0.89" pg_no="32">In this section we introduce ETD-OD, an object detection dataset for layout analysis on
scholarly long documents such as ETDs.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[181.59, 286.29, 644.28, 358.56]" conf="0.8" pg_no="33">3.3.1 Dataset Source</name>
				<paragraphs>
					<para bbox="[176.38, 393.73, 1521.95, 753.44]" conf="0.93" pg_no="33">The ETD-OD dataset consists of 25K page images from 200 theses and dissertations. These
documents were downloaded from publicly accessible institutional repositories, and were
uniformly sampled with regards to degree, domain, and institution. Since object detection
requires images as the input data, the documents were split into page images using the
Python library. These images were then used for annotation.
pdf2image1</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.5, 841.25, 566.21, 911.75]" conf="0.81" pg_no="33">3.3.2 Annotation</name>
				<paragraphs>
					<para bbox="[183.73, 952.85, 1514.89, 1229.39]" conf="0.89" pg_no="33">We use Roboflow2 for annotating the page images in our dataset. Each annotation was done
by one of the 6 undergraduate students, each of whom was a computer science student from
junior year or above. Each data sample was further validated for correctness by two graduate
students.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.17, 1328.09, 690.31, 1400.37]" conf="0.8" pg_no="33">3.3.3 Dataset Statistics</name>
				<paragraphs>
					<para bbox="[176.28, 1438.71, 1524.7, 1914.18]" conf="0.95" pg_no="33">Table 3.1 shows the detailed statistics for different object categories in our dataset. The
dataset consists of page images and bounding boxes spanning across different
∼25K ∼100K
object categories. Owing to the variation in the frequency of occurrence of various object
categories in documents, some categories have many more samples as compared to others.
Elements such as paragraphs can be found on most pages, and hence, it is the dominant
category in our dataset. 80% of the images and their corresponding objects were used for
training, while the remaining 20% were used as the validation set.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[502.48, 471.88, 1178.85, 1627.72]" conf="0.84" pg_no="34">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_34_1L6U.jpg</path>
						<caption bbox="[181.79, 1642.73, 1516.3, 1824.54]" conf="0.79" pg_no="34">Table 3.1: Distribution of different object categories in our dataset.
Note: Some of the doc-
uments were accompanied with front matter (metadata) pages that are sometimes generated
by the digital libraries. We include annotations for such documents as well, and hence, the
number of metadata elements does not exactly match the number of documents.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[183.42, 279.82, 830.8, 361.35]" conf="0.87" pg_no="35">3.4 Proposed Framework</name>
				<paragraphs>
					<para bbox="[180.56, 401.34, 1517.22, 625.71]" conf="0.94" pg_no="35">We now introduce the proposed framework for extracting important elements from an ETD
by means of object detection. The architecture of our framework is illustrated in Figure 5.3.
The different modules shown can broadly be divided into the following three categories.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.83, 696.09, 814.39, 768.82]" conf="0.86" pg_no="35">3.4.1 Data and Preprocessing</name>
				<paragraphs>
					<para bbox="[179.4, 800.54, 1519.3, 1090.75]" conf="0.93" pg_no="35">Since our framework is primarily built for parsing long scholarly documents, it takes the
PDF version of the document as input. The input file is converted to individual page images
(.jpg format) using Python-based PDF libraries such as Next, the page images
pdf2image.
are individually fed to the Element Extraction module for further processing.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[188.36, 1158.18, 1194.23, 1235.15]" conf="0.84" pg_no="35">3.4.2 Element Extraction using Object Detection</name>
				<paragraphs>
					<para bbox="[177.61, 1271.03, 1526.43, 2040.38]" conf="0.82" pg_no="35">This module forms the backbone of our system. It takes the individual page images as input,
and uses an object detection model such as Faster-RCNN or YOLO for object detection.
These models are first trained on the ETD-OD dataset. The specific details about training
object detection models are included in later sections of this chapter. While using the object
detection models as a part of this module, only inference is performed, and no updates are
made to the model parameters. The output of object detection will be a list of elements,
where each element contains information about the bounding boxes such as the coordinates,
along with the category labels. This process is repeated for all of the pages in the document,
and finally, a list of pages accompanied by their respective elements is populated.
In some instances, the object detected by the model is classified as one belonging to a
different, yet similar category. In such cases, we use certain post-processing rules to correct</para>
					<para bbox="[172.36, 278.05, 1523.21, 701.91]" conf="0.93" pg_no="37">the predictions. For example, being mis-classified as chapter heading is one
abstract heading
of the common errors, since both of these elements are often found in bigger font size at the
beginning of a page. This can, however, be corrected by enforcing a constraint such as: a
chapter heading in the first 10 pages with matching keyword “abstract” will be the abstract
heading. We use a set of such rules for different object types to correct mis-classifications.
This component is discussed in detail in Chapter 6.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[415.61, 466.42, 1214.24, 1767.97]" conf="0.83" pg_no="36">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_36_W7N0.jpg</path>
						<caption bbox="[222.14, 1846.57, 1463.36, 1939.81]" conf="0.7" pg_no="36">Figure 3.1: Architecture of the proposed object detection based parsing framework.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[184.55, 773.9, 1033.12, 847.46]" conf="0.81" pg_no="37">3.4.3 Post-Processing Extracted Objects</name>
				<paragraphs>
					<para bbox="[173.41, 883.72, 1530.62, 1922.81]" conf="0.92" pg_no="37">After extracting all of the elements for all of the pages in the document, we regard the
objects as broadly belonging to two types. The first type includes objects
image-based
such as figures, tables, algorithms, and equations, that need to be stored on the file system
as an image. We regard tables as image-based objects even though they might contain text,
since further extraction of information in structured format from tables is beyond the scope
of this work. The second type of object includes elements such as paragraphs,
text-based
titles, etc., which need further processing to be converted to plain text. We regard all object
categories excluding the image-based ones as textual elements.
For converting text-based objects to plain text, we use off-the-shelf tools and libraries. Some
PDF documents are born-digital, where the text can be easily extracted using Python li-
braries such as based on page ID and bounding box coordinates. For scanned
pymupdf3
documents we use optical character recognition (OCR) tools such as
pytesseract4.
For image-based elements, we record the path of the image that is cropped based on the
coordinates. Figures and tables are mapped to their respective captions based on proximity.
For any figure/table element, the caption object closest to them based on Euclidean distance</para>
					<para bbox="[174.33, 275.7, 1519.5, 570.06]" conf="0.92" pg_no="38">w.r.t. bounding box coordinates is assumed to be the caption. A similar method is followed to
map equations with their equation numbers, with an added constraint that the y-coordinate
of the center of the equation number should fall between min and max y-coordinates of the
equation object.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[179.09, 653.7, 952.81, 740.28]" conf="0.83" pg_no="38">3.5 Object Detection Training</name>
				<paragraphs>
					<para bbox="[153.9, 788.73, 1531.95, 2034.82]" conf="0.93" pg_no="38">We use the ETD-OD dataset introduced in this chapter for training object detection models
for our framework. The models currently supported are:
• [32]: Faster-RCNN is an object detection model that has two stages. A
Faster-RCNN
region proposal network generates regions of interest, which are fed to another network
for final detection. We use the version of Faster-RCNN that uses ResNeXt-101 [44] as
the backbone model.
• [24]: Faster-RCNN (with ResNeXt-101 back-
Faster-RCNN pre-trained on DocBank
bone) is pre-trained on DocBank, and then fine-tuned on ETD-OD. Although DocBank
does not include all of the elements found in ETDs, we hypothesize that the scholarly
nature of documents used in pre-training should help improve the performance over the
vanilla version of the model.
• [18]: YOLO is a family of single stage object detection models that perform
YOLOv5
the processes of localization and detection using a single end-to-end network. This im-
proves the speed without any significant drop in performance. These models have shown
impressive performance on various datasets [42].
• [40]: This is the most recent version of YOLO, which has been shown to
YOLOv7
outperform many object detection models.
Both of the Faster-RCNN models were trained on our dataset for 60K iterations with an</para>
					<para bbox="[171.23, 276.43, 1522.89, 638.98]" conf="0.91" pg_no="39">inference score threshold of 0.7. The models were based on the implementation included
in the open-source detectron2 [43] framework. For the DocBank-pretrained version of the
model, we used the original set of weights and configurations open-sourced by the authors.
Both of the versions of YOLO were based on the open-source implementations, and were
trained for 150 epochs.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[178.26, 849.51, 840.34, 935.28]" conf="0.84" pg_no="39">3.6 Experimental Results</name>
				<paragraphs>
					<para bbox="[183.29, 998.95, 1471.21, 1093.64]" conf="0.77" pg_no="39">In this section, we discuss the results obtained in the experimental analysis of our work.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.42, 1286.38, 718.15, 1361.81]" conf="0.81" pg_no="39">3.6.1 Evaluation Metrics</name>
				<paragraphs>
					<para bbox="[175.59, 1417.26, 1524.51, 2035.74]" conf="0.94" pg_no="39">For the quantitative evaluation of object detection models, the commonly used metrics are
average precision (AP) and mean average precision (mAP). AP is defined as the area under
the precision-recall curve for a specific class. mAP is the average of AP values for all object
classes. Both of these metrics have different versions based on the overlap threshold (also
referred to as used for comparing the predicted object
Intersection over Union or IoU)
against ground truth. For example, in all of the objects with an intersection
mAP@0.5,
of 50% or more with the ground truth will be regarded as correct predictions. Another
commonly used version of mAP is which is the average mAP over different
mAP@0.5-0.95,
thresholds, from 0.5 to 0.95 with step 0.05.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[473.68, 278.51, 1218.35, 547.71]" conf="0.89" pg_no="40">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_40_2J6A.jpg</path>
						<caption bbox="[185.88, 554.75, 1504.5, 695.59]" conf="0.85" pg_no="40">Table 3.2: mAP comparison for object detection models on ETD-OD. Faster-RCNN* rep-
resents the model pre-trained on DocBank and fine-tuned on ETD-OD. Underlined values
indicate best performing models.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[180.84, 734.24, 1506.81, 890.74]" conf="0.87" pg_no="40">3.6.2 Analysis of Various Object Detection Models Trained on
ETD-OD</name>
				<paragraphs>
					<para bbox="[163.79, 932.45, 1533.24, 2028.33]" conf="0.94" pg_no="40">Table 3.2 shows performance of different object detection models on the validation set of our
dataset. The following observations can be made from the mAP values shown.
• The basic
Pre-training on scholarly documents improves model performance:
version of Faster-RCNN without any pre-training on scholarly documents has the lowest
performance among all the models. The same model, after pre-training on DocBank, and
then fine-tuned on the ETD dataset, gives much better performance. Since DocBank
also consists of scholarly documents, albeit of different type, the pre-training process
exposes the model to a diverse dataset, which eventually results in better generalization
and predictive performance.
• YOLO models belong to the
YOLO outperforms Faster-RCNN on ETD dataset:
class of single stage detectors, which are designed with an emphasis on speed. YOLO
typically performs worse than Faster-RCNN in scenarios where the objects are smaller
or multiple objects are close to each other. However, in the case of documents, most
objects are typically of large size and have minimal overlap with each other due to white
spaces and line breaks around objects (such as between a header and paragraph). Hence,
it outperforms Faster-RCNN on the ETD dataset.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.09, 282.07, 1500.88, 437.18]" conf="0.66" pg_no="41">3.6.3 Analysis of Detection Performance on Different Object Cat-
egories</name>
				<paragraphs>
					<para bbox="[171.06, 1201.69, 1526.66, 1839.25]" conf="0.93" pg_no="41">In Table 3.3, we show the performance of the best performing model (YOLOv7) on various
object categories in our dataset. The lower performance of certain categories can generally
be attributed to two reasons:
• Elements such as degree, date, and algorithm
Limited Number of Training Samples:
have very few instances in our dataset. As such, the performance on these classes is lower
than others.
• Elements such as page number and equation number tend to
Smaller Object Sizes:
be of smaller size as compared to other elements. Since object detection models tend to
struggle with localization of smaller objects, performance of such classes is impacted.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[435.41, 465.13, 1259.78, 1050.87]" conf="0.84" pg_no="41">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_41_DP3Q.jpg</path>
						<caption bbox="[188.54, 1059.76, 1502.65, 1160.2]" conf="0.84" pg_no="41">Table 3.3: AP@0.5 values for different object categories for YOLOv7 (Abs. = Abstract,
LOC = List of Contents).</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[184.71, 1897.37, 1406.54, 1981.09]" conf="0.78" pg_no="41">3.6.4 Comparison against Other Layout Detection Datasets</name>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[196.54, 1209.1, 1498.62, 1899.35]" conf="0.83" pg_no="42">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_42_M02N.jpg</path>
						<caption bbox="[205.53, 1900.94, 1492.9, 1979.31]" conf="0.61" pg_no="42">ETD-OD)
Figure 3.2: Examples of outputs generated by the Faster-RCNN* and YOLOv7 models.</caption>
					</figure>
				</figures>
				<tables>
					<table>
						<path bbox="[420.08, 338.14, 1297.8, 992.53]" conf="0.78" pg_no="42">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_42_MJG2.jpg</path>
						<caption bbox="[185.76, 1003.92, 1509.39, 1145.93]" conf="0.83" pg_no="42">Table 3.4: AP@0.5 values for categories supported by DocBank using Faster-RCNN trained
on different datasets and evaluated on the validation set of ETD-OD. For we list
Caption,
the Figure Caption / Table Caption values for models trained on ETD-OD.</caption>
					</table>
				</tables>
			</section>
		</chapter>
		<chapter>
			<title bbox="[179.6, 269.46, 1363.37, 744.1]" conf="0.91" pg_no="43">Chapter 4
Augmentation-Based Training for
Layout Analysis Models</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[174.91, 873.98, 769.42, 963.71]" conf="0.78" pg_no="43">4.1 Chapter Overview</name>
				<paragraphs>
					<para bbox="[170.06, 1281.37, 1518.01, 2010.39]" conf="0.83" pg_no="43">In Chapter 3, we introduced a dataset that can be used to train object detection models
to extract scholarly elements from ETDs. While having high quality manually annotated
datasets is an ideal method to train supervised machine learning methods, the high costs of
manual annotation often restrict researchers from getting access to large datasets. Hence,
there is a need to develop methods that can exploit the limited amount of manually annotated
datasets to the highest capacity. One such method is data augmentation, which augments
the existing training data curated for object detection training, by applying one or more
augmentation steps to each training image, while utilizing the annotations of the original
image. In this chapter, we explain an augmentation-based training approach for training
object detection models. We used this approach to train layout analysis for ETDs, and
experimental results show that augmentation-based training yields better performing models.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.26, 276.89, 828.38, 364.28]" conf="0.84" pg_no="44">4.2 Image Augmentation</name>
				<paragraphs>
					<para bbox="[172.89, 419.85, 1527.65, 1318.09]" conf="0.92" pg_no="44">N
We start by introducing data augmentation for images. We are given a set of images
= ..., ...,
and annotations where denotes the set of bounding box
I
bk
{i1, iN} {b1, bN},
coordinates and the corresponding labels associated with image We also consider a set of
ik.
.., m &lt;
=
image transformation functions and number of augmentation steps M.
F {f1, fM}
m
For each image our image augmentation process first samples the transformation
ik,
functions from Each of these transformations is iteratively applied on the image to
F.
generate an augmented version of the While many different types of augmentations
image^ik.
have been proposed for images, for our setting we limit it to techniques that do not modify
the underlying size or orientation of the image, but rather modify the visual aspects of the
image. The derived image can thus use the annotation of the source image without any
bk
modifications. This process can be repeated multiple times, each time with a different value
m
of and the corresponding sample of augmentation steps, to generate multiple augmented
versions of an image.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.26, 1476.54, 1096.78, 1564.04]" conf="0.84" pg_no="44">4.3 Types of Image Transformations</name>
				<paragraphs>
					<para bbox="[174.35, 1613.53, 1522.77, 2036.75]" conf="0.94" pg_no="44">The different types of image transformations that we use to generate augmented dataset are
discussed below. An example of a page along with each of its augmented versions is shown
in Fig. 4.1. While the example here shows the versions generated by applying each of the
image transformation individually, in practice, we apply a series of augmentation steps to
generate harder samples. The augmented images thus generated are more likely to match
real world distortions that can be found in scholarly documents.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.57, 285.9, 825.51, 359.14]" conf="0.85" pg_no="45">4.3.1 Brightness and Contrast</name>
				<paragraphs>
					<para bbox="[184.82, 402.29, 1509.29, 680.35]" conf="0.92" pg_no="45">This step supports modifying the brightness and contrast of the original image. Since schol-
arly documents often contain multiple figures and tables, each with a varied range of colors,
and can often be scanned, we hypothesize that models trained on images of varying bright-
ness and contrast can be helpful.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.78, 806.83, 489.22, 871.45]" conf="0.8" pg_no="45">4.3.2 Erosion</name>
				<paragraphs>
					<para bbox="[185.2, 916.05, 1516.2, 1134.5]" conf="0.93" pg_no="45">Many academic documents, especially the scanned ones, often contain eroded text, i.e., text
with broken boundaries. Due to erosion, the elements lose their clarity. This transformation
can allow models to better adapt to such examples.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.21, 1255.35, 500.53, 1321.97]" conf="0.81" pg_no="45">4.3.3 Dilation</name>
				<paragraphs>
					<para bbox="[183.23, 1365.55, 1520.3, 1587.74]" conf="0.92" pg_no="45">Like erosion, often times scanned documents may contain dilated text resulting from the
process of scanning. Dilation happens an element expands, resulting in some objects being
merged. To perform well on such cases, training on dilated versions can be helpful.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.27, 1706.2, 496.12, 1771.79]" conf="0.79" pg_no="45">4.3.4 Borders</name>
				<paragraphs>
					<para bbox="[185.33, 1816.06, 1513.91, 2030.24]" conf="0.92" pg_no="45">Many documents, when scanned, can contain borders resulting from the edges of binding. To
allow object detection models to be able to identify such noise, training on border-augmented
images can be helpful.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.47, 288.56, 549.54, 357.47]" conf="0.85" pg_no="46">4.3.5 Downscale</name>
				<paragraphs>
					<para bbox="[187.49, 395.0, 1510.29, 541.57]" conf="0.9" pg_no="46">Downscaling reduces the number of pixels in an image, thus reducing the sharpness of each
object in the image.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.64, 620.62, 431.67, 686.97]" conf="0.81" pg_no="46">4.3.6 Blur</name>
				<paragraphs>
					<para bbox="[187.39, 727.53, 1506.01, 872.78]" conf="0.9" pg_no="46">Documents have a wide range of variance in terms of resolution. Training on blurred images
can allow models to become more robust to such variance.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.67, 948.7, 787.24, 1023.98]" conf="0.86" pg_no="46">4.3.7 Salt and Pepper Noise</name>
				<paragraphs>
					<para bbox="[179.81, 1052.0, 1516.47, 1272.8]" conf="0.91" pg_no="46">Noisy patches such as those resembling small dots of white/black colors like salt/pepper
sprinkles are common in the case of scanned documents. This augmentation can be helpful
to deal with such samples.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.43, 1348.38, 628.67, 1417.88]" conf="0.83" pg_no="46">4.3.8 Random Lines</name>
				<paragraphs>
					<para bbox="[185.28, 1455.91, 1517.28, 1666.06]" conf="0.93" pg_no="46">Another type of noise that is common in scanned documents is jagged lines, which are a
result of the scanning process. To allow layout analysis on such documents, we include this
augmentation.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[180.14, 1762.15, 502.98, 1837.14]" conf="0.79" pg_no="46">4.4 Results</name>
				<paragraphs>
					<para bbox="[182.7, 1885.1, 1513.85, 2036.49]" conf="0.9" pg_no="46">In this section, we discuss the experimental results obtained in our evaluation. We focus
our evaluation on two aspects, as discussed below. For each setting below, we use the</para>
					<para bbox="[175.97, 276.14, 1518.71, 569.27]" conf="0.91" pg_no="47">ETD-OD dataset introduced in Chapter 3 as the original dataset. For each of the images
in the training set, we generate 2 augmented versions, by applying up to 3 augmentation
functions per augmented image. The number and type of augmentation functions is sampled
individually for each generated image.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.02, 638.53, 486.81, 706.76]" conf="0.77" pg_no="47">4.4.1 Models</name>
				<paragraphs>
					<para bbox="[178.61, 744.39, 1529.31, 1509.79]" conf="0.95" pg_no="47">We use the following models for our experimental evaluation. Each setting uses YOLOv7
[40] as the object detection model, as this was the best performing object detection model
on ETD-OD, as discussed in Chapter 3.
• This is the version of YOLOv7 trained on the original object detection
YOLOv7base:
dataset. This model serves as the baseline model that has been trained without using
any augmented data.
• This is the version of YOLOv7 trained on the original object detection
YOLOv7aug:
dataset, along with the derived data consisting of 2 augmented versions per image.
Due to the inclusion of the augmented dataset in training, the training dataset size
becomes the dataset used in This is the model being evaluated for
YOLOv7base.
3×
augmentation-based testing.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.71, 1578.14, 1012.83, 1653.21]" conf="0.86" pg_no="47">4.4.2 Layout Detection of Digital ETDs</name>
				<paragraphs>
					<para bbox="[181.93, 1687.31, 1515.23, 2029.09]" conf="0.91" pg_no="47">In this experiment, we want to determine if co-training on augmented images derived from
digital ETDs along with original images can improve the performance of layout analysis
on digital ETDs. Hence, we use the test split of ETD-OD as the evaluation dataset. We
evaluate the performance each of the two models, i.e., and These
YOLOv7base YOLOv7aug.
results are shown in Table 4.1.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[435.16, 280.54, 1254.93, 467.6]" conf="0.9" pg_no="48">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_48_FVCN.jpg</path>
						<caption bbox="[194.98, 475.15, 1503.63, 574.64]" conf="0.84" pg_no="48">Table 4.1: mAP scores of two different versions of YOLOv7 on test set consisting of digital
ETDs.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[185.85, 607.35, 1042.34, 677.85]" conf="0.87" pg_no="48">4.4.3 Layout Detection of Scanned ETDs</name>
				<paragraphs>
					<para bbox="[178.11, 707.83, 1515.58, 979.54]" conf="0.93" pg_no="48">In this experiment, we evaluate if the augmentation-based training can be helpful in the
layout analysis of scanned ETDs. Hence, we use the test split of the scanned images from
ETD-ODv2 (introduced in Chapter 5) as the evaluation dataset. The result for each of the
two models is shown in Table 4.2.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[432.43, 986.54, 1257.05, 1159.4]" conf="0.9" pg_no="48">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_48_W7JY.jpg</path>
						<caption bbox="[191.92, 1172.21, 1499.21, 1269.95]" conf="0.85" pg_no="48">Table 4.2: mAP scores of two different versions of YOLOv7 on test set consisting of scanned
ETDs.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[181.93, 1382.55, 509.71, 1451.35]" conf="0.77" pg_no="48">4.4.4 Analysis</name>
				<paragraphs>
					<para bbox="[184.44, 1495.53, 1518.96, 2033.51]" conf="0.85" pg_no="48">Based on the results shown in Tables 4.1 and 4.2, it can be observed that i.e.,
YOLOv7aug,
the model trained on augmented dataset alongside original dataset, outperforms the baseline
model in both of the settings. The performance improvement on digital ETDs is marginal,
which can be attributed to the fact that the validation set only consists of clean page images
with limited distortions. Thus, the improved prediction capability of the model does not
get tested in this setting. However, there is a significant performance improvement when
tested on page images from scanned documents. Since scanned documents are more likely to
contain distortions, obtaining good predictive performance requires the model to be robust</para>
					<para bbox="[169.2, 276.25, 1522.49, 507.08]" conf="0.82" pg_no="49">to such distortions. The model trained on augmented images is more likely to be robust
to such distortions, which can be seen from the better performance of over the
YOLOv7aug
baseline model.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[152.23, 285.88, 1531.72, 2065.65]" conf="0.37" pg_no="50">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_50_D2NZ.jpg</path>
						<caption bbox="[389.79, 2101.02, 1320.84, 2166.14]" conf="0.69" pg_no="50">Figure 4.1: An example of a page with its augmented versions.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[179.35, 267.4, 1468.09, 743.4]" conf="0.92" pg_no="51">Chapter 5
AI-Aided Annotation for Developing
Layout Analysis Datasets</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.36, 874.78, 767.11, 960.43]" conf="0.8" pg_no="51">5.1 Chapter Overview</name>
				<paragraphs>
					<para bbox="[171.72, 1020.77, 1526.48, 2023.69]" conf="0.94" pg_no="51">An important aspect of object detection-based methods is that they often require a huge
amount of labeled training data. For digital documents, especially those written in LaTeX,
it is often possible to obtain annotations using rule-based automatic annotation methods
[24]. However, in the case of scanned documents, as well as digital documents without
accompanying LaTeX source code, annotating data is a cumbersome process that requires
a great amount of manual effort. In the case of ETDs, many documents present in dig-
ital libraries, especially the older ones, tend to be scanned documents that were written
using legacy text editing software or with a typewriter. These documents were then mi-
crofilmed and/or scanned and converted to PDF. Consequently, these documents contain a
large amount of noise that was introduced during the PDF conversion process, as shown in
Figure 5.1. Furthermore, given that these documents were prepared using legacy methods,
they differ significantly from newer documents, such as digital ETDs, in terms of layout and
structure. Additionally, some of the elements, such as metadata elements like ETD title and
author name, can only be found on a few pages, while others, such as a paragraph, can be
found on many pages in a document. As such, the distribution of different object categories</para>
					<para bbox="[179.45, 283.73, 1508.09, 426.16]" conf="0.81" pg_no="52">in the training data varies. This also affects the performance of object detection models in
classes with a limited number of training instances.</para>
					<para bbox="[169.18, 1488.59, 1533.14, 2047.1]" conf="0.83" pg_no="52">In this chapter, we propose an AI-aided annotation framework to minimize the amount of
resources such as annotation time associated with developing training datasets for layout
analysis. Our proposed framework utilizes the predictive capabilities of models trained on
existing datasets to assist human annotators. As illustrated in Figure 5.2, although the
annotations generated by the model might not all be correct, many of them are correct.
Having humans only enter annotation corrections can reduce the number of instances that
need to be manually labeled. This significantly speeds up the annotation process, without
compromising the quality of the generated dataset. It also helps to address the problem of</para>
					<para bbox="[169.8, 278.65, 1523.35, 844.08]" conf="0.83" pg_no="53">class imbalance in object detection datasets, by guiding annotators to selectively label im-
ages, e.g., those that are more likely to contain elements from a predefined set. Experimental
results show that our proposed annotation scheme significantly reduces the annotation time
and class imbalance, thus resulting in models with improved performance across the set of
object classes. We also introduce ETD-ODv2, a new dataset for object detection-based lay-
out analysis of long documents such as theses and dissertations. ETD-ODv2 supplements
the page images included in ETD-OD, adding 20K page images originating from scanned
theses and dissertations. It also adds annotations for page images that are likely to con-</para>
					<para bbox="[174.15, 939.59, 1514.53, 1473.78]" conf="0.81" pg_no="54">tain low-frequency elements, such as and since they can only be
document title algorithm,
found on selected pages of a document, or in documents from specific domains (e.g., equa-
tions in a physics work). These pages were sourced from a large corpus consisting of both
scanned and digital documents, making them helpful for mitigating the class imbalance in
existing datasets as well. ETD-ODv2 thus addresses the limitations of existing datasets for
ETD layout analysis, whose scope is limited to digital documents only, and suffers from a
class imbalance problem. Our experimental results show that models trained on our newly
annotated dataset perform much better than those trained on other datasets.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[192.45, 441.15, 1512.04, 1381.51]" conf="0.91" pg_no="52">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_52_ECIE.jpg</path>
						<caption bbox="[429.73, 1390.92, 1274.19, 1454.34]" conf="0.73" pg_no="52">Figure 5.1: Examples of pages from scanned documents.</caption>
					</figure>
					<figure>
						<path bbox="[183.31, 908.41, 1502.9, 1860.75]" conf="0.94" pg_no="53">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_53_67QK.jpg</path>
						<caption bbox="[185.11, 1870.75, 1512.09, 2014.39]" conf="0.84" pg_no="53">Figure 5.2: An illustration showing a page from a scanned document, the annotations gen-
erated by an object detection model trained on a small dataset, and the final annotations
after correction by a human annotator.</caption>
					</figure>
					<figure>
						<path bbox="[199.05, 292.19, 1507.43, 830.25]" conf="0.87" pg_no="54">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_54_654P.jpg</path>
						<caption bbox="[311.38, 835.87, 1399.29, 907.03]" conf="0.69" pg_no="54">Figure 5.3: Architecture of the proposed AI-aided annotation framework.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[187.4, 1557.36, 1251.39, 1645.38]" conf="0.84" pg_no="54">5.2 Proposed AI-aided Annotation Scheme</name>
				<paragraphs>
					<para bbox="[179.67, 1685.33, 1516.63, 2033.82]" conf="0.91" pg_no="54">Due to the resource-intensive nature of the dataset annotation process, labeled data for
training supervised machine learning models are always scarce. However, unlabeled data are
generally available in abundance. This is also the case with document layout analysis, where
getting high-quality annotations for documents and their respective pages is not easy. How-
ever, given the numerous documents that exist on the Internet and in digital libraries, many</para>
					<para bbox="[171.02, 276.55, 1525.78, 900.56]" conf="0.93" pg_no="55">unlabeled scholarly documents are publicly available. Although labeling document page im-
ages is a cumbersome task, we hypothesize that models trained on existing datasets can be
used to assist human annotators in the labeling process, thus reducing the time required to
annotate training datasets. These models can be used to generate weak labels for the huge
corpus of unlabeled ETDs, which can then be filtered, validated, and corrected by human
annotators. Based on this assumption, in this section, we propose an AI-aided annotation
framework for developing datasets to train supervised object detection models. Figure 5.3
gives an overview of our proposed framework. The key components of this framework are
discussed in detail below.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.26, 1059.77, 691.14, 1131.89]" conf="0.83" pg_no="55">5.2.1 Dataset Sampling</name>
				<paragraphs>
					<para bbox="[180.45, 1184.02, 1511.61, 1467.33]" conf="0.89" pg_no="55">We use a large corpus of unlabeled ETDs, sourced from multiple open access digital libraries.
We first sample a set of documents from this unlabeled corpus that can be used for AI-aided
annotation. Each of these documents is then split into page images, since object detection
models require images as input.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[185.07, 1625.25, 1106.12, 1701.59]" conf="0.86" pg_no="55">5.2.2 Weak Labels Using Pre-Trained Model</name>
				<paragraphs>
					<para bbox="[181.15, 1748.34, 1520.24, 2038.06]" conf="0.93" pg_no="55">Once we have a set of documents as well as their respective page images, they are sent to an
object detection model such as YOLO [40] or Faster-RCNN [32] that has been pre-trained
on an existing labeled dataset, such as ETD-OD [1]. The labels thus inferred for each image
serve as weak annotations for further processing and manual verification/annotation.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[190.32, 282.39, 1228.66, 362.88]" conf="0.85" pg_no="56">5.2.3 Optional Filtering for Specific Object Classes</name>
				<paragraphs>
					<para bbox="[178.93, 398.52, 1513.44, 936.5]" conf="0.91" pg_no="56">In some cases, such as in the case of academic documents like theses and dissertations,
labeling the entire set of pages found in the sampled documents could result in a highly
unbalanced dataset. In such cases, it might be desirable to use weak labels to identify
images containing a pre-defined set of object categories. We refer to these object categories
as objects of interest. These categories include minority classes, such as those containing
very few instances in the labeled dataset, or those that have lower performance as compared
to other categories. This could enable researchers to produce datasets with balanced class
distributions.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.76, 1004.8, 1046.13, 1080.71]" conf="0.84" pg_no="56">5.2.4 Manual Verification and Correction</name>
				<paragraphs>
					<para bbox="[177.62, 1111.6, 1524.05, 1681.17]" conf="0.95" pg_no="56">The filtered set of pages, along with their predicted bounding boxes and their respective
labels, is then verified by human annotators for correctness. For page images with correctly
predicted objects, no changes are made and the respective page is added to the verified
dataset. For page images with incorrect predictions, whether in terms of missing or incorrect
labels, the correct bounding boxes are drawn by human annotators before being added to
the verified dataset.
The new dataset can then be used to fine-tune existing pre-trained models or in combination
with existing datasets for model training.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[179.24, 1757.32, 803.3, 1843.57]" conf="0.83" pg_no="56">5.3 ETD-ODv2 Dataset</name>
				<paragraphs>
					<para bbox="[181.02, 1883.91, 1513.86, 2037.24]" conf="0.87" pg_no="56">In this section, we introduce ETD-ODv2, a new dataset for layout analysis of electronic
theses and dissertations. Although existing datasets like ETD-OD [1] can be helpful in</para>
					<para bbox="[184.52, 1183.07, 1515.68, 1321.2]" conf="0.88" pg_no="57">layout extraction from digital documents, they suffer from a class imbalance problem and
do not contain scanned documents.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[181.94, 289.55, 1510.93, 1074.9]" conf="0.92" pg_no="57">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_57_6A5T.jpg</path>
						<caption bbox="[544.57, 1082.52, 1155.28, 1136.54]" conf="0.8" pg_no="57">Table 5.1: ETD-ODv2 dataset statistics.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[182.72, 1421.5, 742.55, 1493.41]" conf="0.81" pg_no="57">5.3.1 Scanned Documents</name>
				<paragraphs>
					<para bbox="[173.48, 1532.64, 1526.77, 2038.27]" conf="0.94" pg_no="57">There are several attributes related to scanned documents that are not found in digital
documents. These include the following.
• A common observation found in scanned documents is that a large
Noisy patches:
number of pages contain noisy patches that result from the process of converting such
documents into an electronically readable PDF file.
• Given that these documents are essentially images of hard-copy versions
Low resolution:
of the original document, they tend to have relatively low resolution.</para>
					<para bbox="[161.71, 272.86, 1527.76, 988.99]" conf="0.94" pg_no="58">• Another common observation regarding many scanned docu-
Dilated or eroded text:
ments is that the text is eroded (i.e., has a thinner font than the original document) or
dilated. This can also be attributed to the PDF conversion process.
• Some of the pages of scanned documents contain elements –
Handwritten elements:
such as tables, figures, and equations – that were written or drawn by hand and were not
typed or created using software.
Due to the presence of such attributes, object detection models trained on the digital docu-
ments dataset generally do not perform well on scanned documents. Hence, our new dataset
includes manually annotated page images from scanned documents, to support layout anal-
ysis on scanned documents.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[187.31, 1102.82, 1072.22, 1178.14]" conf="0.83" pg_no="58">5.3.2 Page Images with Minority Elements</name>
				<paragraphs>
					<para bbox="[176.11, 1219.83, 1524.78, 2031.89]" conf="0.95" pg_no="58">While it is desirable to have images of pages from scanned documents, this does not prevent
the dataset from being subject to a class imbalance problem. This is because some elements
– such as and – typically only appear on a small set of pages in
document title author name
the document, such as the front page. Therefore, a dataset constructed by labeling all pages
appearing in a document will always be prone to the class imbalance problem. Moreover,
some element classes such as might only appear in documents in certain domains,
algorithm
such as computer science. Hence, a set of documents uniformly sampled from several different
domains will have few pages with such instances. To alleviate this problem, we use the
proposed AI-aided annotation method to identify/filter and annotate pages that are more
likely to contain such minority elements. These page images were sourced from both digital
and scanned documents. The elements that we consider to be minority elements are listed
below.</para>
					<para bbox="[173.21, 271.86, 1523.25, 570.91]" conf="0.91" pg_no="59">• Title, Author, Date, University,
Elements found on a limited number of pages:
Committee, Degree, Abstract Text, List of Contents Heading.
• Equation, Equation Num-
Elements found in documents from select disciplines:
ber, Algorithm, Reference Heading.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[185.37, 676.88, 1031.97, 752.53]" conf="0.84" pg_no="59">5.3.3 Dataset Source and Object Classes</name>
				<paragraphs>
					<para bbox="[180.97, 788.5, 1520.47, 1082.61]" conf="0.89" pg_no="59">To ensure compatibility with existing datasets, we use the object categories defined in ETD-
OD for annotation. The documents in both subsets of our data set (i.e., the scanned and
AI-aided) were sourced from a uniformly sampled set of theses and dissertations from open
access institutional repositories of U.S. origin [39].</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.16, 1185.12, 697.7, 1260.34]" conf="0.78" pg_no="59">5.3.4 Dataset Statistics</name>
				<paragraphs>
					<para bbox="[184.93, 1299.24, 1400.16, 1389.39]" conf="0.73" pg_no="59">Table 5.1 shows the detailed statistics of different object categories in our dataset.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[164.48, 1472.1, 578.48, 1559.31]" conf="0.58" pg_no="59">Scanned Documents</name>
				<paragraphs>
					<para bbox="[181.22, 1592.72, 1524.62, 1935.77]" conf="0.76" pg_no="59">The subset of scanned documents in our dataset consists of images and bounding box an-
notations of pages, derived from 100 theses and dissertations. These documents were
∼16K
annotated by a group of five undergraduate students [49]. To ensure the correctness, each
sample also went through another round of review by one of the authors of [2]. We use
Roboflow1 as the dataset annotation platform.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[166.17, 281.0, 813.18, 364.83]" conf="0.61" pg_no="60">Pages with low-frequency elements</name>
				<paragraphs>
					<para bbox="[177.56, 386.11, 1522.73, 683.43]" conf="0.88" pg_no="60">Our dataset also consists of page images from documents that were annotated
∼20K ∼1,200
using our proposed AI-aided annotation framework. The pages were then filtered based on
the labels listed above and reviewed and corrected as needed by a group of four annotators
[11].</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[174.41, 772.6, 633.38, 856.27]" conf="0.78" pg_no="60">5.4 Experiments</name>
				<paragraphs>
					<para bbox="[177.63, 895.21, 1522.8, 1256.77]" conf="0.92" pg_no="60">In this section, we report the experimental results obtained during our evaluation. Our
experiments focus on determining the improvements in terms of human resources, such as
annotation time, obtained using the AI-aided annotation strategy. We also analyze whether
the new dataset, consisting of scanned documents and pages with instances from lower-
frequency categories, can be helpful in improving the performance of object detection models.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[178.53, 1327.85, 690.18, 1406.97]" conf="0.79" pg_no="60">5.4.1 Annotation Time</name>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[174.2, 1438.17, 563.58, 1516.59]" conf="0.63" pg_no="60">Experimental Setup</name>
				<paragraphs>
					<para bbox="[179.96, 1544.23, 1523.1, 1954.76]" conf="0.9" pg_no="60">To construct our proposed AI-aided annotation framework, we used the bounding box widget
from the open source framework which was integrated with a pre-trained object
pylabel2,
detection model. We trained a YOLOv7 model [40] on ETD-OD [1] and a small set of
∼2K
scanned documents. We only used a small number of samples from the scanned documents
dataset, as that was the only sample available at the time. The model obtained was then used
in our AI-aided framework to generate the proposed labels. We will refer to this model as</para>
					<para bbox="[166.61, 271.37, 1526.82, 507.97]" conf="0.83" pg_no="61">in the remainder of the discussion. As noted in [1], YOLOv7 outperforms
YOLOv7_base
other models in the object detection task, so we use it as the detection model for empirical
evaluation.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[161.89, 548.44, 576.14, 639.82]" conf="0.63" pg_no="61">Evaluation Settings</name>
				<paragraphs>
					<para bbox="[153.39, 666.35, 1537.59, 1722.07]" conf="0.86" pg_no="61">To determine whether the proposed AI-aided annotation scheme reduces resource require-
ments, we compare the time required to label images under different settings.
• This is the classical labeling setting under which the annotators
No Model Assistance:
are shown neither bounding boxes nor the respective labels for page images.
• Under this setting, for each image, the annotators were shown the bound-
AI-Aided-v1:
ing boxes generated by the model.
YOLOv7_base
• For this setting, we fine-tuned the model on a set of
AI-Aided-v2: YOLOv7_base
10K page images labeled using our AI-aided annotation scheme. This was done to eval-
uate whether the assistance of a model trained on an additional new dataset affects the
annotation time. We then used this model to generate bounding boxes for each image
shown to the annotators.
In the two AI-aided settings, annotators were asked first to review the model-generated
annotations. All correct annotations were left unchanged, and only missing, incorrect, or
extra-bounding boxes were asked to be modified. For each of the three settings, each of the
four annotators annotated pages, and the time spent on annotation was recorded.
∼500</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[162.75, 1773.29, 366.84, 1853.42]" conf="0.55" pg_no="61">Results</name>
				<paragraphs>
					<para bbox="[173.66, 1882.82, 1525.82, 2039.96]" conf="0.79" pg_no="61">In Figure 5.4, we report the average time spent per page by each of the annotators under
different annotation settings. The following observations can be made:</para>
					<para bbox="[167.75, 279.05, 1530.66, 1297.01]" conf="0.89" pg_no="62">• As we can observe from
Model assistance significantly reduces annotation time:
the graph, the average time required to annotate a page without the assistance of a
model (i.e., without any proposed bounding boxes) is 2-3 times longer than for each of
the AI-aided settings. This is likely because even though the models used for assisting
annotators might have been trained on limited data and coverage (in terms of document
types and object classes), they still possess predictive power to help with many of the
elements found in pages, such as paragraphs and figures. Thus, we can conclude that the
assistance of models trained on existing data significantly helps in annotating more data
by reducing the time required for annotation.
• Another observation that
Model assistance increases with better trained models:
can be made from Figure 5.4 is that as we obtain models with better predictive power, the
suggested labels of the model become more accurate, further reducing the time required
to annotate a page. The model used for the setting had been trained on
AI-Aided-v2
10K more samples than the one used in setting. The samples used were
AI-Aided-v1
also more balanced in terms of object classes. Therefore, it has better predictive power,</para>
					<para bbox="[209.6, 284.77, 1006.0, 368.44]" conf="0.79" pg_no="63">enabling it to be more helpful to human annotators.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[475.84, 1326.9, 1219.97, 1941.02]" conf="0.87" pg_no="62">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_62_XG4R.jpg</path>
						<caption bbox="[237.5, 1947.46, 1448.78, 2021.28]" conf="0.64" pg_no="62">Ann-1 Ann-2 Ann-3 Ann-4
Figure 5.4: Annotation time for each annotator under different annotation settings.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[179.57, 469.71, 951.6, 547.64]" conf="0.83" pg_no="63">5.4.2 Object Detection Performance</name>
				<paragraphs>
					<para bbox="[178.39, 584.3, 1524.55, 742.01]" conf="0.73" pg_no="63">In this analysis, we present our findings on how the AI-aided annotated dataset helps improve
object detection performance. The specific details of this analysis are described below.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[163.26, 815.19, 650.93, 904.86]" conf="0.57" pg_no="63">Object Detection Model</name>
				<paragraphs>
					<para bbox="[170.68, 932.81, 1525.6, 1295.37]" conf="0.73" pg_no="63">As stated above, we use YOLOv7 as the benchmark object detection model for this analysis.
Since the purpose of this analysis is to determine how training on different datasets impacts
model performance, the specific choice of object detection model is beyond the scope of this
analysis. Moreover, previous studies have shown that YOLOv7 is the state-of-the-art model
for object detection tasks [1, 40, 42].</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[158.6, 1366.74, 461.72, 1453.91]" conf="0.59" pg_no="63">Test Dataset</name>
				<paragraphs>
					<para bbox="[165.22, 1481.78, 1524.15, 2041.26]" conf="0.64" pg_no="63">Since the AI-aided subset of our dataset was constructed with the objective of mitigating the
class imbalance problem, it consists of page images from documents of several types, such
as scanned and digital. Therefore, to analyze how training with the AI-aided dataset helps
object detection models on various types of documents, we construct a test dataset consisting
of page images sampled from ETD-OD [1], as well as the scanned and low-frequency element
pages from ETD-ODv2. This is done to ensure that the test set is representative of diversity
in terms of both document types and object types. The breakdown of images and objects in
the test dataset is shown in Table 5.2.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[494.0, 280.84, 1194.03, 564.28]" conf="0.91" pg_no="64">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_64_PFED.jpg</path>
						<caption bbox="[517.45, 574.57, 1170.94, 630.37]" conf="0.77" pg_no="64">Table 5.2: Distribution of the test dataset.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[165.17, 687.94, 390.81, 764.04]" conf="0.6" pg_no="64">Baselines</name>
				<paragraphs>
					<para bbox="[163.57, 807.02, 1526.08, 1635.9]" conf="0.87" pg_no="64">We use the versions of the dataset listed below to evaluate object detection performance. All
versions used YOLOv7 as the object detection model. The number of images and objects in
each version is listed in Table 5.3.
• This version of the model was trained only on the digital document images from
Digital:
ETD-OD. As such, the training dataset contained a small number of samples from the
minority classes due to the class imbalance in the scanned subset.
• This version of the model was trained only on the scanned subset of the ETD-
Scanned:
ODv2 dataset. As in the previous setting, the training dataset used in this setting also
has the class imbalance problem.
• Under this setting, the YOLOv7 model was trained on the com-
Digital + Scanned:
bined images of scanned and digital documents, that is, a merged set consisting of the
two dataset splits described above.</para>
					<para bbox="[173.92, 272.46, 1521.52, 560.0]" conf="0.75" pg_no="65">• This setting uses the split
Digital + Scanned + AI-Aided: Digital + Scanned
described above, along with the AI-aided subset of ETD-ODv2. This setting represents a
model that has been trained on diverse types of document (i.e., digital and scanned) and
consists of a larger number of training instances from each object category.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[309.23, 1672.91, 1385.16, 1946.25]" conf="0.92" pg_no="64">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_64_6F1H.jpg</path>
						<caption bbox="[318.67, 1951.65, 1368.77, 2012.41]" conf="0.68" pg_no="64">Table 5.3: Statistics of different versions of the data set used for training.</caption>
					</table>
					<table>
						<path bbox="[177.09, 577.16, 1508.98, 1600.74]" conf="0.9" pg_no="65">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_65_1BPW.jpg</path>
						<caption bbox="[476.82, 1613.59, 1207.45, 1676.63]" conf="0.68" pg_no="65">Table 5.4: Object detection performance results.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[167.32, 1772.55, 559.96, 1854.26]" conf="0.55" pg_no="65">Evaluation Metrics</name>
				<paragraphs>
					<para bbox="[174.18, 1882.3, 1520.45, 2040.39]" conf="0.77" pg_no="65">We use the two commonly used object detection metrics to evaluate the results of different
models discussed above. Both metrics are based on the average precision (AP), which is</para>
					<para bbox="[156.56, 272.9, 1536.25, 994.41]" conf="0.89" pg_no="66">calculated based on the number of predicted objects that overlap with the ground-truth
object over a certain threshold in terms of the area. The two metrics are described in detail
below.
• For a given object category, AP@0.50 is the percentage of
AP@0.50 / mAP@0.50:
predicted bounding boxes that overlap with the true bounding boxes by more than 50%
in terms of area. mAP@0.50 is the average of AP@0.50 for all object categories.
• This is calculated by first calculating the AP at
AP@0.50:0.95 / mAP@0.50:0.95:
different thresholds, from 0.50 to 0.95, with a step of 0.05. All these AP values are
averaged to compute AP@0.50:0.95 for an object category. mAP@0.50:0.95 is the average
of AP@0.50:0.95 for all object categories.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[157.73, 1079.28, 373.91, 1162.39]" conf="0.55" pg_no="66">Results</name>
				<paragraphs>
					<para bbox="[158.77, 1197.24, 1532.05, 2038.01]" conf="0.9" pg_no="66">Table 5.4 shows the results obtained on the test dataset described above in each of the
training settings. Based on the results shown, the following observations can be made.
• The subset of images used to train the
Performance w.r.t. document type: Scanned
model had the highest amount of noise and lowest quality (e.g., blurred) as compared to
the training dataset used for other models. This results in the lowest overall performance
of the model.
• The model was trained on the smallest training
Size of the training dataset: Scanned
dataset. Consequently, it has the lowest performance among all four variants. The large
size of the training dataset used in helps achieve the
Digital + Scanned + AI-Aided
best overall performance.
• We also find that training on a dataset with a
Performance on minority classes:
better distribution in terms of object classes significantly improves performance. As can</para>
					<para bbox="[169.74, 283.71, 1535.51, 1306.81]" conf="0.91" pg_no="67">be seen from the results shown, the performance of certain categories, such as
Degree
and increased by This shows that model performance on certain low-
Algorithm,
∼20%.
performing categories can be improved by training on a larger number of samples from
such categories.
• Another observation
Weak labels can be helpful signals for targeted annotation:
that can be made from the performance improvements achieved on low-frequency cate-
gories is that weak labels generated from an existing model can serve as a good indicator
for more targeted annotation. Although using such labels cannot guarantee coverage,
they can still address performance issues to a great extent.
• Finally, we can also observe that performance improvements
Overall performance:
are achieved in other categories that were not included in the filter set. This can be
attributed to the fact that while the AI-Aided data consisted of pages filtered based on
the occurrence of minority elements, these pages also contained other elements in addition
to those from the filter set. This helped the model to be trained on more samples from
other object categories as well, thus improving the performance across all object classes.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[165.82, 261.06, 1450.36, 747.18]" conf="0.91" pg_no="68">Chapter 6
Structured Representations of Long
Scholarly Documents</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.62, 874.68, 767.21, 959.19]" conf="0.8" pg_no="68">6.1 Chapter Overview</name>
				<paragraphs>
					<para bbox="[170.39, 1000.71, 1528.42, 2023.81]" conf="0.96" pg_no="68">In Chapters 3, 4, and 5, we discussed how object detection can be used to detect and extract
important scholarly elements from long documents such as ETDs. However, the scope of
these chapters was limited to extracting elements from the document pages. In reality, a
long PDF document such as an ETD consists of many pages, each of which contributes
to the overall organization of the document, which can be represented as a hierarchical
structure. Converting the of objects extracted from layout parsing methods
“unordered set”
to a structured format which can represent the organization of information in an ETD
can be very helpful to support downstream tasks such as document/figure search, chapter
summarization, etc. The structured versions can also be used to support accessibility needs
of those with disabilities, by means of accessibility tools such as on-screen readers. However,
generating structured versions of ETDs is a non-trivial task, and involves several challenges,
as discussed below:
• Delimiters, such as chapter and section elements, are one of
Identifying delimiters:
the most important components of the information structure of an ETD. The inherently
long nature of ETDs makes correct identification of delimiters an important component</para>
					<para bbox="[167.87, 276.75, 1529.15, 1124.34]" conf="0.94" pg_no="69">in ETD parsing. They are useful in segmenting the document into multiple smaller
components, thus making it easier for the reader. They are also useful in downstream
tasks that rely on segmented units of a long document, such as chapter summarization.
• Many object types have relationships between each other, and
Linking objects:
correct identification of such relationships can be useful in several downstream tasks.
For example, linking figures and/or tables to the respective captions can be useful
in figure/table search. As such, identifying such relationships is important during
information extraction from scholarly documents.
In this chapter, we address the task of converting the extracted set of elements to a structured
format, such as XML, so that the information in a document can be made useful for other
downstream tasks. We also present a system that can allow for easy navigation of a long
PDF document, using the information from the generated XML format.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.72, 1205.49, 655.04, 1289.38]" conf="0.77" pg_no="69">6.2 XML Schema</name>
				<paragraphs>
					<para bbox="[171.72, 1331.67, 1528.93, 2039.09]" conf="0.94" pg_no="69">Based on the structure of an ETD, in Schema 6.1, we present an XML schema that can be
used to capture the organization of content in an ETD in a structured format. The schema
is based on the following observations.
• The overall information in an ETD can broadly be encapsulated into three high-level
categories. consists of elements that can give key identifiable information, as well
front
as an overall summary about the work. These include metadata elements, abstract,
and lists(s) of contents, figures, and tables.
• consists of elements that can give in-depth information about the content of a
body
document. It contains a list of chapters, each of which further contains a list of sections.
The sections encapsulate detailed informational elements contained therein.</para>
					<para bbox="[194.41, 277.78, 1517.21, 435.86]" conf="0.27" pg_no="70">• consists of information that often is not critical for the understanding of a docu-
back
ment. This includes a list of references and the appendices.</para>
				</paragraphs>
				<footnotes/>
				<algorithms>
					<algorithm bbox="[165.86, 289.89, 1525.63, 1322.44]" conf="0.81" pg_no="71">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_71_22IM.jpg</algorithm>
				</algorithms>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[174.7, 1612.86, 739.12, 1698.46]" conf="0.77" pg_no="71">6.3 XML Generation</name>
				<paragraphs>
					<para bbox="[177.65, 1746.61, 1522.41, 2041.95]" conf="0.91" pg_no="71">As discussed earlier, two challenges hinder the process of converting a PDF and its respective
objects from each of the pages into the XML format shown above. We will address these
challenges by observing the errors found in a uniformly sampled set of documents, and then
formulating a set of rules derived based on domain expertise regarding document structure.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[180.98, 283.68, 777.74, 360.55]" conf="0.81" pg_no="72">6.3.1 Identifying Delimiters</name>
				<paragraphs>
					<para bbox="[158.27, 428.86, 1539.19, 2040.02]" conf="0.93" pg_no="72">We discuss some of the commonly found errors in delimiters below. While the list is not
exhaustive and might not cover all possible errors, it is based on a user study of a sample
consisting of 25 ETDs by 2 undergraduate students from Virginia Tech’s course CS 4624
(Multimedia, Hypertext, and Information Access) in Spring 2023.
• Last line of a paragraph on a new page being detected as a chapter heading.
Error:
Many chapter headings in ETDs appear as first line of a document, and are
Reason:
only a few words (less than a line) long. The last line of a paragraph resembles such
chapter headings.
Proposed Rules:
Chapter headings that do not start with a capital case letter are re-labeled as
–
paragraph.
The last paragraph of the previous page should have its last character as an end
–
punctuation.
• Chapter headings in headers and footers being regarded as start of new chap-
Error:
ters.
Many documents contain contain headers and footers on every page, which
Reason:
contains the title of the current chapters. Due to similarities between such elements
and the actual chapter title, such as the presence of “chapter” keyword, the model
might regard them as chapter titles.
When identifying a new instance of a element in the
Proposed Rules:
chapter
parser, ensure that the title of the new chapter differs from the previous chapter.
Some other rules that are applied to chapter elements include:
• For all the detected chapters, we check for their
Presence in Table of Contents:</para>
					<para bbox="[171.37, 282.26, 1533.88, 1320.4]" conf="0.94" pg_no="73">existence in the table of contents. A list of all the entries from the table of contents
is extracted, and then each of the elements is checked against this set of entries. The
matching is done using fuzzy string matching, to make sure the chapter titles overlap
with at least one table of contents entry, with similarity above a certain threshold.
This threshold will be derived empirically. Additionally, since we also detect the
page
as one of the objects, we can match the page number in the table of contents
number
entry against the chapter title and its detected page number as a further validation
step.
• Chapter titles often appear on the top of the page. Based on
Location in the page:
this observation, we can filter out all the chapter titles that occur in the first half of
the page based on the y-coordinate of the tentatively detected chapter titles.
While such rules may be helpful in fixing incorrect predictions, their scope is limited to false
positive predictions only. This means that the rules cannot help us identify the objects that
were not detected by the object detection method. Those objects can only be identified by
a better object detection model, and we leave that as future work.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[190.23, 1433.94, 1275.24, 1511.72]" conf="0.83" pg_no="73">6.3.2 Linking Figures and Tables with their Captions</name>
				<paragraphs>
					<para bbox="[178.79, 1549.43, 1523.13, 2037.52]" conf="0.95" pg_no="73">For each of the elements types (e.g., figures and tables) that need to be linked with their
captions, we first identify the order of element and their caption. Some documents may
contain a caption below the figures, while others might contain captions above. The same
also applies to tables. Hence, for each document, we iterate through all the detected figures
and count the number of figures that have a caption above them, and the number of figures
that have a caption below. Based on the maximum of the two numbers, we determine the
order of figures and their captions. The same process is followed for tables to determine the</para>
					<para bbox="[171.33, 276.62, 1523.4, 660.57]" conf="0.91" pg_no="74">table-caption order.
Next, for each figure and table, based on the determined order, we find the nearest corre-
sponding caption element. A special case is figures that have captions on different pages. A
methodology to link such figures with their captions would be a direction for future work in
this domain.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[186.47, 729.67, 1171.01, 808.63]" conf="0.78" pg_no="74">6.3.3 Linking Equations and Equation Numbers</name>
				<paragraphs>
					<para bbox="[182.89, 842.48, 1514.89, 994.89]" conf="0.9" pg_no="74">Equation elements are linked to the nearest equation number elements based on the y-
coordinate.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[189.91, 1084.22, 1504.02, 1172.83]" conf="0.83" pg_no="74">6.4 PDF to HTML Browser for Improved Accessibility</name>
				<paragraphs>
					<para bbox="[175.54, 1211.46, 1521.11, 1698.7]" conf="0.95" pg_no="74">In addition to generating structured representations of the entire PDF using objects detected
from individual page images, wedevelop a working system that allows users to view ETDs
in an accessible format. The system allows users to upload the document of their preference
and then view it in web-based UI. This system is built as a Flask application, which first
generates the structured version of a document based on the XML format shown earlier, and
then displays the document in the browser. This system offers multiple use-cases, as listed
below.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[187.79, 1774.38, 1110.73, 1851.24]" conf="0.87" pg_no="74">6.4.1 User-friendly View of Long Documents</name>
				<paragraphs>
					<para bbox="[188.39, 1885.34, 1512.85, 2035.47]" conf="0.91" pg_no="74">One of the well-known problems of ETDs is that they are inherently long documents, and
navigating them is hard. Some existing studies [41] have shown that allowing users to be</para>
					<para bbox="[176.57, 276.96, 1519.71, 507.18]" conf="0.92" pg_no="75">able to read long PDF documents in a web-based application is helpful and can improve
the readability of such documents. By allowing users to view a long ETD in a web-based
application, we expect increased usage and adoption of such documents by researchers.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[191.18, 584.07, 1324.4, 664.4]" conf="0.84" pg_no="75">6.4.2 Improved Accessibility for Those with Disabilities</name>
				<paragraphs>
					<para bbox="[177.54, 694.23, 1523.78, 1250.68]" conf="0.94" pg_no="75">A common limitation of PDF documents is their limited compatibility with accessibility
tools such as on-screen readers. This is crucial for users with special needs, such as those
with blindness, as such users often rely on accessibility tools for access to knowledge. In
recent years, tools such as PREP1 have been developed, to allow with tagging PDFs to make
them compatible with on-screen readers. However, based on our analysis, it was found that
automatic tagging feature of PREP does not work well in the case of ETDs, thus limiting
the usability of such documents by users with accessibility needs. On the other hand, HTML
based applications can be very well integrated with on-screen readers.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.53, 1348.77, 684.42, 1433.34]" conf="0.81" pg_no="75">6.5 System Design</name>
				<paragraphs>
					<para bbox="[174.29, 1474.73, 1525.54, 1953.28]" conf="0.95" pg_no="75">Figure 6.1 shows an overview of our PDF to HTML parsing system for ETDs. The system
allows users to view long PDF versions of ETDs in a user-friendly and accessible format in
a web-based interface. While currently the system requires users to upload a PDF file they
want to parse and view, in the future this will be merged with the integrated system for
ETDs, expected to be developed by the CS5604 class in Fall 2023. This will allow offline
processing, so that users can view one of the many ETDs in an institutional repository in
an accessible format. The different components of the UI are explained in detail below.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[181.38, 281.97, 1504.19, 1028.54]" conf="0.79" pg_no="76">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_76_640M.jpg</path>
						<caption bbox="[381.98, 1031.78, 1310.59, 1094.46]" conf="0.75" pg_no="76">Figure 6.1: An overview of the PDF to XML to HTML system.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[182.05, 1162.42, 816.3, 1233.42]" conf="0.78" pg_no="76">6.5.1 Side Bar for Navigation</name>
				<paragraphs>
					<para bbox="[175.33, 1287.66, 1525.22, 2041.72]" conf="0.92" pg_no="76">ETDs are inherently long documents consisting of multiple components such as chapters.
Each chapter often consists of multiple sub-components such as sections, wherein key infor-
mation such as text, figures, and tables exist. To allow users to navigate through a long
document, it is often desirable to have a high-level view of the document. While digital
documents, such as those written in support this via applications such as Acrobat
LATEXcan
Viewer, many ETDs do not support this either because they were written using a legacy
tool, or are scanned documents. Hence, to help with such documents, our system allows
navigation using a collapsible side bar. The side bar shows a list of chapters that were
extracted from the document. Each chapter is a nested list that consists of the sections in
the corresponding chapter. Some sections also contain elements such as figures and tables,
which often contain important findings of a document. Hence, a third level of nesting shows</para>
					<para bbox="[181.25, 282.62, 1516.6, 436.94]" conf="0.89" pg_no="77">a list of tables and figures (based on the captions) for the corresponding section. Each of the
entries in the sidebar have hyperlinks to the corresponding element in the main document.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[182.49, 502.48, 551.29, 571.52]" conf="0.83" pg_no="77">6.5.2 PDF View</name>
				<paragraphs>
					<para bbox="[180.65, 610.43, 1516.69, 957.25]" conf="0.92" pg_no="77">To allow users to keep track of the original documents, as well to support cross referencing,
the original PDF document is shown in the right side bar. This sidebar can be extended
in width for those who might want to have a detailed look at the PDF document. It also
serves as a testing tool for the document parser, so that researchers can evaluate the quality
of extracted components by directly cross-referencing them with the original document.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.04, 1026.12, 659.45, 1096.37]" conf="0.82" pg_no="77">6.5.3 Document View</name>
				<paragraphs>
					<para bbox="[178.21, 1132.21, 1523.8, 1613.49]" conf="0.95" pg_no="77">This is the main component of our document viewer, that shows the content of the document.
The top part of this space shows the document metadata such as title, author name, and
university. It is followed by the main content of the document. Figures, tables, equations,
and algorithms are displayed as images. Each of the contents shown in this section can be
cross-referenced in the original PDF being displayed in the right side bar using a click. This
functionality allows users to cross-reference elements such as mathematical text, which are
likely to become erroneous or confusing in the PDF-to-text extraction process.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[177.95, 266.89, 1356.57, 743.17]" conf="0.92" pg_no="78">Chapter 7
Topic Modeling based System for
Analyzing and Browsing ETDs</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.45, 874.99, 768.04, 959.45]" conf="0.78" pg_no="78">7.1 Chapter Overview</name>
				<paragraphs>
					<para bbox="[174.18, 1003.05, 1526.85, 2023.77]" conf="0.96" pg_no="78">As discussed earlier, many downstream tasks rely on NLP algorithms, which require specific
elements of a long document, such as title, abstract, chapter text, etc. One such line of work
that is of value in the analysis of ETDs is topic modeling, which aims to extract thematic
collections of words that could represent topics, from a large corpus of text documents. The
representations learned from topic models can be used for downstream tasks that rely on
document representations, such as finding similar documents (document recommendation),
finding similar topics, analyzing the variation of topics over time, etc.
In this work, we propose ETD-Topics, a topic modeling based framework for analyzing and
discovering information contained in ETDs using several state-of-the-art topic models. Our
framework allows users to extract topics present in an ETD collection using any one of the
several topic models provided. Users can then select a topic of interest, and do further anal-
ysis of the topic using multiple end-user services supported in our framework. Supported
services include searching documents associated with a particular topic, calculating the dis-
tribution of the documents w.r.t. topics, document recommendation, topic recommendation,
and topic trend analysis based on time range and/or university. Moreover, since topic mod-</para>
					<para bbox="[181.49, 283.31, 1516.0, 437.34]" conf="0.86" pg_no="79">els are unsupervised in nature, our framework does not require any handcrafted labels such
as categories, thereby making it easily deployable and scalable for new document collections.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[178.99, 553.71, 816.79, 640.4]" conf="0.8" pg_no="79">7.2 System Architecture</name>
				<paragraphs>
					<para bbox="[181.22, 1187.91, 1199.84, 1264.62]" conf="0.78" pg_no="79">Fig. 7.1 shows the architecture of our framework, as described below.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[189.78, 690.26, 1516.34, 1048.13]" conf="0.89" pg_no="79">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_79_62HZ.jpg</path>
						<caption bbox="[542.65, 1058.92, 1147.29, 1117.99]" conf="0.82" pg_no="79">Figure 7.1: An overview of ETD-Topics.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[183.92, 1375.37, 582.34, 1442.61]" conf="0.79" pg_no="79">7.2.1 Data Source</name>
				<paragraphs>
					<para bbox="[179.52, 1496.05, 1510.67, 2030.49]" conf="0.77" pg_no="79">Since our framework aims to assist in analysis of massive amounts of ETD data, we require
a large collection of text ETDs. For each ETD, we use its title and abstract as the corre-
sponding text. This text is then tokenized and goes through a series of preprocessing steps,
such as stop word and punctuation removal, removing terms with low document frequency
(infrequent words), and lemmatization. We also drop documents whose token count is less
that a certain threshold number (20 in this case), as these are likely to be documents with
limited or missing text. Finally, we obtain a list of tokens for each document that can be
sent to the topic modeling module.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.0, 285.1, 652.73, 360.22]" conf="0.8" pg_no="80">7.2.2 Topic Modeling</name>
				<paragraphs>
					<para bbox="[160.12, 433.05, 1534.89, 2030.08]" conf="0.94" pg_no="80">This module forms the main backbone of our system. It takes the preprocessed data as input
and uses topic modeling algorithms to extract the topics from the document corpus. The
topic modeling algorithms currently supported are:
• [6]: LDA is one of the earliest topic models, that uses Bayesian priors as the ini-
LDA
tial document-topic and topic-word assignments, and then updates these distributions
based on the probability with which a document or a word is associated with a certain
topic.
• [36]: This is the neural network based version of LDA, that utilizes a
NeuralLDA
variational inference method for learning document-topic representations.
• [36]: This is an improved version of NeuralLDA, that is designed to give
ProdLDA
more coherent and interpretable topics.
• [5]: In contrast to other topic models that use bag-of-words representations for
CTM
text and hence ignore the order of words, this model combines representations from
language models like BERT [20] in the topic modeling process, thus incorporating word
context.
Since topic models require several iterations over the dataset for training, we train all the
models offline, using different numbers of topics for each model. We set the number of
topics (denoted as K) to {10, 25, 50, 100} while training the models, thus resulting in 16
pre-trained models (4 models, each with a different value of K, for each of the 4 algorithms
listed above).
Topic models typically give two types of outputs. The first is a topic-word distribution
K×V
V
matrix, where is the vocabulary size. Each matrix row represents the importance of each
M K
of the words in the vocabulary for the respective topic. The second is an document-
×</para>
					<para bbox="[175.51, 280.44, 1518.41, 439.03]" conf="0.86" pg_no="81">M
topic distribution matrix, where is the number of documents in the corpus; each row
represents the proportion of each of the topics in the respective document.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[177.01, 509.12, 617.74, 583.25]" conf="0.75" pg_no="81">7.2.3 User Services</name>
				<paragraphs>
					<para bbox="[174.48, 616.37, 1523.27, 772.07]" conf="0.69" pg_no="81">The front end user interface (UI) encapsulates multiple downstream tasks and services for
users of a digital library. Below are descriptions of services illustrated in Fig. 7.2.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[161.9, 822.0, 485.0, 907.42]" conf="0.48" pg_no="81">Topic Browser</name>
				<paragraphs>
					<para bbox="[171.43, 938.81, 1534.35, 2042.79]" conf="0.79" pg_no="81">Our framework allows users to select a topic modeling technique and number of topics. Users
can then use the following services:
• This module helps users find the most popular
Documents per Topic Distribution:
topics in the document collection. Given a threshold value (on a scale of [0, 1], default
= 0.3) and a topic, this component calculates the number of documents in the entire
database for which the given topic constituted more than the threshold. The overall
results are displayed as a histogram, where each bar shows the number of documents
for that respective topic.
• For every topic, this module shows the top 10 words that are representa-
Topic List:
tive of that topic; the set thus serves as a type of label. Because of the unsupervised
nature of topic models, it is not possible to get a short semantically/disciplinary ap-
propriate label for each topic. Hence, we display the top representative words for each
topic.
• Some users work in interdisciplinary fields. Often, a selected topic
Similar Topics:
might not be directly related to users’ preferences, but might still be correlated with
the users’ requirements, e.g., for researchers working in inter-disciplinary fields. In</para>
					<para bbox="[209.58, 272.65, 1530.51, 512.59]" conf="0.83" pg_no="83">such instances, it is often desirable to show a list of related topics to the user. To
facilitate this process, this module shows related topics for a given topic. This is done
based on similarities between different rows of the topic-word matrix.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures>
					<figure>
						<path bbox="[185.74, 532.83, 1472.6, 1588.55]" conf="0.94" pg_no="82">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_82_8K5R.jpg</path>
						<caption bbox="[182.44, 1632.99, 1516.11, 1790.08]" conf="0.76" pg_no="82">Figure 7.2: A snapshot of different user services. (a) Documents per Topic Distribution and
Topic List, (b) Similar Topics and Topic Specific Documents for one topic, (c) Document
page showing Related Topics and Similar Documents for one document, (d) Trend Analysis.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[163.58, 569.72, 567.84, 656.47]" conf="0.66" pg_no="83">Document Browser</name>
				<paragraphs>
					<para bbox="[169.91, 686.47, 1535.96, 1462.32]" conf="0.85" pg_no="83">The document browser allows users to get specific documents based on their interests. It
mainly consists of two modules:
• This module allows users to get relevant documents for
Topic Specific Documents:
one of the many topics shown in the It selects the documents based
Topic Browser.
on the presence of the selected topic in the document using the corresponding values
of the document-topic vectors. It then displays the title and abstract of the selected
document. It also allows users to get more details of a specific document by clicking
on it.
• This module assists users in finding documents that are similar
Related Documents:
to a selected document. This is especially useful in the case of scholarly documents,
since users are typically interested in finding multiple related works.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[164.55, 1521.57, 599.92, 1606.48]" conf="0.59" pg_no="83">Topic Trend Analysis</name>
				<paragraphs>
					<para bbox="[182.69, 1617.73, 1532.32, 2047.16]" conf="0.84" pg_no="83">• Many users of a digital library, such as university administrators
Temporal Analysis:
and faculty members, are interested in analyzing how different research areas trend over
time. This module allows users to filter documents associated with a topic in a given
time range (in years).
• In some instances, users are interested in analyzing
University-Specific Analysis:
research trends in their institution, or in peer institutions. This module shows users</para>
					<para bbox="[256.32, 282.24, 1512.28, 436.65]" conf="0.9" pg_no="84">such research trends, by university. Additionally, users can combine this feature with
temporal analysis to visualize institution-specific research trends over time.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[181.04, 542.8, 970.39, 627.95]" conf="0.87" pg_no="84">7.3 System Setup and Analysis</name>
				<paragraphs>
					<para bbox="[186.84, 672.02, 1518.84, 891.04]" conf="0.92" pg_no="84">The discussion in this section corresponds to what is reported in [15]. Since the study
reported therein, our collection (both size and scope) and work with ETD-Topics has broad-
ened.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.48, 983.12, 890.05, 1058.09]" conf="0.87" pg_no="84">7.3.1 Dataset and System Details</name>
				<paragraphs>
					<para bbox="[186.96, 1095.96, 1516.44, 1311.89]" conf="0.9" pg_no="84">Our dataset has ETDs from over 42 universities. They come from 1845 – 2020, with
∼320K
most published after 1945. Our topic models are from open source implementations included
in OCTIS [37]. The UI was developed using Flask1 with a Python backend.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[183.8, 1406.07, 717.96, 1479.47]" conf="0.84" pg_no="84">7.3.2 Evaluation Metrics</name>
				<paragraphs>
					<para bbox="[179.94, 1515.23, 1526.73, 1948.67]" conf="0.96" pg_no="84">We evaluate the different topic modeling algorithms on two commonly used metrics from
the topic modeling literature. These are explained below:
• is a measure of how distinct the top words of a topic are w.r.t. top words in
Diversity
other topics. A score of 0 indicates redundancy, while 1 indicates very diverse topics.
• measures the degree of semantic similarity between top words from the
Coherence
same topic. Models with high coherence tend to give more interpretable topics.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables>
					<table>
						<path bbox="[227.32, 274.62, 1470.47, 583.48]" conf="0.92" pg_no="85">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_85_E500.jpg</path>
						<caption bbox="[198.55, 592.1, 1500.54, 684.57]" conf="0.81" pg_no="85">Table 7.1: Quantitative comparison of different models, with underlined values indicating
best performing models.</caption>
					</table>
					<table>
						<path bbox="[232.05, 702.5, 1473.28, 921.58]" conf="0.91" pg_no="85">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_85_YK4P.jpg</path>
						<caption bbox="[360.58, 929.55, 1336.12, 991.12]" conf="0.73" pg_no="85">Table 7.2: Corresponding words for a topic from different models.</caption>
					</table>
				</tables>
			</section>
			<section>
				<name bbox="[183.96, 1027.69, 1102.75, 1099.25]" conf="0.82" pg_no="85">7.3.3 Comparison of Different Topic Models</name>
				<paragraphs>
					<para bbox="[174.3, 1135.98, 1526.97, 2036.37]" conf="0.95" pg_no="85">Table 7.1 shows the performance of the four different topic models, for each of the four
numbers of topics, on our collection, for the two metrics discussed above.
We observe that NeuralLDA produces more diverse topics than other models, indicated by its
high diversity score, with CTM being the second best performing model in terms of diversity.
However, the coherence scores for CTM are much better than other models, indicating more
interpretable topics. A good topic model should ideally have high coherence and diversity
scores, since high diversity and low coherence could also mean that the topics are composed
of unique, yet unrelated words which do not indicate any themes. In Table 7.2 we also show
the corresponding words for one topic obtained from all the models. The topic produced by
NeuralLDA is less coherent, indicated by words like and in line with its low
thesis introduce,
coherence scores. In contrast, the topics produced by LDA and ProdLDA are cleaner and
have fewer words that are semantically different than the rest of the words, though they do
have some open-ended words like and CTM produces the most coherent topic,
user provide.</para>
					<para bbox="[179.34, 284.17, 1510.5, 435.12]" conf="0.9" pg_no="86">which is also reflected by its high coherence scores. It appears that CTM is the best overall
performing model on our ETD corpus.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[189.03, 573.64, 1502.65, 764.42]" conf="0.88" pg_no="86">7.4 Integrating ETD-Topics with Other End-User Ser-
vices</name>
				<paragraphs>
					<para bbox="[177.66, 813.73, 1511.87, 1160.41]" conf="0.85" pg_no="86">In addition to supporting browsing and navigation by means of end-user services illustrated
in Fig. 7.1, our framework can be integrated with many other APIs and end-user services
that require document representations for user satisfaction. An example of such a service is
a search / information retrieval system, which allows users to search for documents related
to user queries.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[184.98, 1290.96, 1190.45, 1368.41]" conf="0.85" pg_no="86">7.4.1 Overview of Information Retrieval Systems</name>
				<paragraphs>
					<para bbox="[178.21, 1411.35, 1526.22, 1894.45]" conf="0.85" pg_no="86">Many modern information retrieval systems use search engine frameworks like Apache Lucene2
and Apache Solr3, which can be used to search for documents that match a user query in
a large document collection. Users can then obtain detailed information that best satisfies
their query by clicking on one or more documents returned by the search engine. However,
often the document(s) returned by the search engine do not fully satisfy users’ requirements.
This is especially the case in scholarly document search, where many users are interested in
a wide range of documents, e.g., while doing literature surveys.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations>
					<equation>
						<path bbox="[200.85, 1921.73, 660.57, 2028.72]" conf="0.28" pg_no="86">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_86_NMI8.jpg</path>
						<eq_no bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
					</equation>
				</equations>
				<figures>
					<figure>
						<path bbox="[220.32, 305.73, 1458.57, 853.58]" conf="0.9" pg_no="87">static/PhD_Final_Aman/detected_images/PhD_Final_Aman_87_DH8O.jpg</path>
						<caption bbox="[187.77, 870.06, 1508.89, 1053.64]" conf="0.88" pg_no="87">Figure 7.3: Integrating search engine module with topic models from ETD-Topics framework
for document recommendation. An example of a search query, its search results returned by
a BM25 based search engine, and recommended documents for one highlighted document
are shown.</caption>
					</figure>
				</figures>
				<tables/>
			</section>
			<section>
				<name bbox="[186.65, 1121.79, 1501.94, 1281.29]" conf="0.87" pg_no="87">7.4.2 Integrating Document Recommendation with Document Re-
trieval</name>
				<paragraphs>
					<para bbox="[174.59, 1332.77, 1527.68, 2039.39]" conf="0.95" pg_no="87">To improve user experience, a search and retrieval service is often integrated with a document
recommendation service to allow users access to a wider range of possibly relevant documents.
Traditional document recommendation systems primarily rely on historical user click logs.
Such logs can be difficult to obtain for scholarly documents such as ETDs, since many
ETD-related services are offered by university libraries which have a smaller user base as
compared to commercially available services. Document recommendation services in such
scenarios hence need to be supported using auxiliary information that does not rely on user
logs.
Our framework supports document recommendation using the document representations
learned from topic models, which can be used to find semantically similar documents. Since</para>
					<para bbox="[170.17, 276.29, 1522.84, 902.14]" conf="0.94" pg_no="88">the topic modeling based representations can be learned in an unsupervised way, it does not
require large amounts of user logs to support such services. An overview of this integration
is illustrated in Fig. 7.3. After a user sends a search query to the search engine, the
search engine returns a set of documents presumed relevant to the query. A user can then
select a document of interest to obtain more information about that document. This is
further integrated with a document recommendation module that utilizes the topic models
to first obtain the document representation, and then uses an approximate nearest neighbors
technique to compute a list of similar documents. This list is then displayed to the user as
recommended documents.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[187.43, 985.76, 1456.36, 1068.98]" conf="0.83" pg_no="88">7.4.3 Extending Topic Modeling from Documents to Chapters</name>
				<paragraphs>
					<para bbox="[175.68, 1098.94, 1525.01, 1655.99]" conf="0.95" pg_no="88">The CS 5604 class in Fall 2022 at Virginia Tech worked on an integrated system for ETDs
that will support APIs for several services such as search, question-answering, chapter seg-
mentation, chapter summarization, etc. [7]. In future, using the segmented chapters and
their corresponding text for the ETD corpus, chapter-level topics could be extracted us-
ing the pre-trained topic models, by means of the framework proposed in this work. The
end-user services proposed in this chapter, such as document recommendation and searching
documents by topics, can then be supported at the level of chapters to support chapter
recommendation and searching chapters by topics. This remains a work for the future.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[179.17, 1753.54, 780.46, 1837.66]" conf="0.83" pg_no="88">7.5 Further Evaluation</name>
				<paragraphs>
					<para bbox="[183.33, 1884.75, 1516.17, 2036.52]" conf="0.88" pg_no="88">Further evaluation of some of the components proposed in this chapter, such as evaluating
the quality of recommended documents, requires user studies. The system developed by the</para>
					<para bbox="[167.07, 275.89, 1524.42, 511.45]" conf="0.82" pg_no="89">CS 5604 class is aimed at supporting user studies, and in the future can be extended and
used to evaluate such components. These user studies, however, will be conducted by other
graduate students and their results will be included as part of their research.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
		<chapter>
			<title bbox="[167.11, 274.59, 600.25, 599.93]" conf="0.86" pg_no="90">Chapter 8
Conclusion</title>
			<sections/>
			<section>
				<name bbox="[0, 0, 0, 0]" conf="0" pg_no="0"/>
				<paragraphs/>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[176.02, 738.95, 594.31, 819.61]" conf="0.79" pg_no="90">8.1 Conclusion</name>
				<paragraphs>
					<para bbox="[164.27, 910.24, 1533.31, 2026.33]" conf="0.95" pg_no="90">This dissertation aims to address the needs of digital library users by developing datasets,
techniques, and systems for analyzing and navigating long documents, such as ETDs. Since
end-user services in a digital library often rely on NLP models that require data in a machine-
friendly format, a significant part of this research aims to address the problem of document
parsing, by means of object detection based layout analysis methods. As a use-case for the
extracted data and to address the problem of limited training data for supporting end-user
services such as document browsing and recommendation, we also present a topic-modeling
based framework for ETDs.
In summary, the contributions of this research are as follows.
• We develop datasets for training object detection based layout analysis methods for
long scholarly documents. These datasets cover a range of document types, such as
born-digital and scanned documents. They could also be useful for layout analysis on
other types of documents, such as books and patents, due to an overlapping set of
object types such as figures, tables, and paragraphs. Hence, we expect these datasets
to be a valuable resource for the document understanding community.
• We develop methodologies for document parsing and information extraction from long</para>
					<para bbox="[205.35, 277.56, 1529.38, 901.57]" conf="0.94" pg_no="91">scholarly documents. We hope that they will be helpful in making long documents more
accessible and reader-friendly, by supporting other end-user services such as document
search and retrieval, question-answering, and long document summarization. They
will also allow easy testing of document understanding methods, and we expect them
to be a valuable resource for the wider research community.
• To support the needs of digital library users by means of end-user services, we also
propose a topic modeling framework for document browsing and recommendation. The
unsupervised nature of topic modeling addresses the problem of a unified classification
ontology as well as lack of labeled data by research topics.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
			<section>
				<name bbox="[179.42, 1004.04, 905.83, 1089.47]" conf="0.82" pg_no="91">8.2 Summary of Hypotheses</name>
				<paragraphs>
					<para bbox="[169.15, 1135.17, 1531.66, 2040.69]" conf="0.94" pg_no="91">In this section, we give a brief summary of the hypotheses listed in Chapter 1, and the results
obtained in their evaluation. For hypotheses that remain to be evaluated, an evaluation plan
is summarized.
• Object detection based document layout analysis methods for long scholarly docu-
H1:
ments, trained on high quality domain-specific labeled data, perform better than those
trained on a larger dataset originating from other related domains, such as research
papers.
True.
Status:
As shown in Table 3.4, models trained on a smaller dataset of objects
Explanation:
originating from ETDs perform better than those on trained on a larger dataset of
objects from research papers, like DocBank.
• Pre-training on other scholarly datasets, albeit from a different domain such as
H2:
research papers, improves the performance of document layout analysis methods on</para>
					<para bbox="[191.82, 286.62, 1541.86, 2036.41]" conf="0.93" pg_no="92">long scholarly documents such as ETDs.
True.
Status:
As shown in Table 3.2, models like Faster-RCNN* that are pre-trained
Explanation:
on other scholarly datasets and then fine-tuned on ETD dataset perform better than
those that were not pre-trained.
• Training on derived datasets, such as augmented versions of the original training
H3:
data, can significantly improve the performance of layout analysis models.
True.
Status:
As shown in Table 4.1, training on a dataset obtained by augmenting
Explanation:
images in the original dataset improves the object detection performance.
• To perform well on other document types, such as scanned documents, object
H4:
detection models trained on a specific type of documents, such as born-digital ones,
require additional training using techniques, like augmentation, that help bridge the
distribution gap.
True
Status:
The results shown in Table 4.2 show that augmentation-based training
Explanation:
results in significant performance improvement for layout analysis of scanned docu-
ments.
• AI-aided annotation methods, such as using models trained on existing smaller
H5:
datasets to extract weak labels for unlabeled data, reduce the resources required for
annotating additional data.
True.
Status:
The comparison of annotation time for manual annotation vs. AI-aided
Explanation:
annotation shown in Figure 5.4 shows that model assistanvce significantly reduces
annotation time.
• Models trained on datasets with skewed distribution in terms of class labels
H6:</para>
					<para bbox="[209.74, 286.19, 1539.8, 1562.82]" conf="0.9" pg_no="93">achieve better performance on minority classes when trained on additional data from
those classes, such as from AI-aided annotation methods.
True.
Status:
The mAP values of models fine-tuned on the dataset resulting out of
Explanation:
AI-aided annotation are higher than those of the initial models (i.e., the model without
fine-tuning on the new dataset), as shown in Table 5.4.
• Combining the predictive power of AI models with rules formulated based on
H7:
domain expertise possessed by humans reduces errors in predictive tasks such as doc-
ument structure parsing.
True.
Status:
A case study was done on a small sample of ETDs by a team from
Explanation:
CS4624 class of Spring 2023. Some of the common errors, as well as rules to remediate
them are discussed in Section 6.3.1. Based on the finding of aforementioned study, it
was determined that post-processing rules are essential for document parsing.
• Neural topic models can outperform other traditional topic models, such as LDA,
H8:
while doing topic modeling on scholarly documents such as ETDs and their chapters.
True.
Status:
The results shown in Tables 7.1 and 7.2 show that neural topic models
Explanation:
like CTM perform better than LDA.</para>
				</paragraphs>
				<footnotes/>
				<algorithms/>
				<equations/>
				<figures/>
				<tables/>
			</section>
		</chapter>
	</body>
	<back>
		<ref_heading bbox="[170.41, 270.65, 658.67, 387.44]" conf="0.83" pg_no="94">Bibliography</ref_heading>
		<ref_text bbox="[175.61, 506.46, 1539.5, 2013.62]" conf="0.94" pg_no="94">[1] Aman Ahuja, Alan Devera, and Edward Alan Fox. Parsing electronic theses and dis-
sertations using object detection. In
Proceedings of the first Workshop on Information
pages 121–130. Association for Computational
Extraction from Scientific Publications,
Linguistics, November 2022. URL
https://aclanthology.org/2022.wiesp-1.14.
[2] Aman Ahuja, Kevin Dinh, Brian Dinh, William A Ingram, and Edward Fox. A new
annotation method and dataset for layout analysis of long documents. In
Companion
pages 834–842, 2023.
Proceedings of the ACM Web Conference 2023,
[3] Dan Anitei, Joan Andreu Sánchez, José Manuel Fuentes, Roberto Paredes, and
José Miguel Benedí. ICDAR 2021 Competition on Mathematical Formula Detection.
In pages 783–795.
International Conference on Document Analysis and Recognition,
Springer, 2021.
[4] Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan
Pletschacher. A Realistic Dataset for Performance Evaluation of Document Layout
Analysis. In
2009 10th International Conference on Document Analysis and Recogni-
pages 296–300. IEEE, 2009.
tion,
[5] Federico Bianchi, Silvia Terragni, and Dirk Hovy. Pre-training is a Hot Topic: Contex-
tualized Document Embeddings Improve Topic Coherence. In
Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 2: Short Papers),
pages 759–766, 2021.</ref_text>
		<ref_text bbox="[170.24, 284.33, 1540.92, 2025.56]" conf="0.94" pg_no="95">[6] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation.
Journal
3(Jan):993–1022, 2003.
of machine learning research,
[7] Satvik Chekuri, Prashant Chandrasekar, Bipasha Banerjee, Sung Hee Park, Nila Mas-
rourisaadat, Aman Ahuja, William A. Ingram, and Edward Alan Fox. Parsing electronic
theses and dissertations using object detection. In
Proceedings of the 23rd ACM/IEEE-
ACM, June 2023.
CS Joint Conference on Digital Libraries.
[8] Muntabir Hasan Choudhury, Jian Wu, William A Ingram, and Edward A Fox. A
Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and
Dissertations. In
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
pages 515–516, 2020.
in 2020,
[9] Muntabir Hasan Choudhury, Himarsha R Jayanetti, Jian Wu, William A Ingram, and
Edward A Fox. Automatic Metadata Extraction Incorporating Visual Features from
Scanned Electronic Theses and Dissertations. In
2021 ACM/IEEE Joint Conference on
pages 230–233. IEEE, 2021.
Digital Libraries (JCDL),
[10] Ran Ding, Ramesh Nallapati, and Bing Xiang. Coherence-Aware Neural Topic Model-
ing. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language
pages 830–836, 2018.
Processing,
[11] Kevin Dinh, Brian Dinh, Andrew Leavitt, and Annie Tran. Object Detection, 2022. URL
Virginia Tech CS4624 team term project.
http://hdl.handle.net/10919/114082.
[12] Ross Girshick. Fast R-CNN. In
Proceedings of the IEEE international conference on
pages 1440–1448, 2015.
computer vision,
[13] Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF
procedure. 2022.
arXiv preprint arXiv:2203.05794,</ref_text>
		<ref_text bbox="[153.27, 289.65, 1541.18, 2029.6]" conf="0.94" pg_no="96">[14] Jaekyu Ha, R.M. Haralick, and I.T. Phillips. Recursive X-Y cut using bounding boxes
of connected components. In
Proceedings of 3rd International Conference on Document
volume 2, pages 952–955 vol.2, 1995. doi: 10.1109/ICDAR.
Analysis and Recognition,
1995.602059.
[15] Chongyu He, Jianchi Wei, and Chenyu Mao. Textmining. 2022. URL
http://hdl.
Virginia Tech CS4624 team term project.
handle.net/10919/109986.
[16] Thomas Hofmann. Probabilistic Latent Semantic Indexing. In
Proceedings of the 22nd
annual international ACM SIGIR conference on research and development in informa-
pages 50–57. ACM, 1999.
tion retrieval,
[17] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-
training for Document AI with Unified Text and Image Masking.
arXiv preprint
2022.
arXiv:2204.08387,
[18] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye
Kwon, TaoXie, Kalen Michael, Jiacong Fang, Imyhxy, , Lorna, Colin Wong, ;;;(Zeng
Yifu), Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laugh-
ing, UnglvKitDe, Tkianai, YxNONG, Piotr Skalski, Adam Hogan, Max Strobel, Mrinal
Jain, Lorenzo Mammana, and Xylieong. ultralytics/yolov5: v6.2 - YOLOv5 Classifica-
tion Models, Apple M1, Reproducibility, ClearML and Deci.ai integrations, 2022. URL
https://zenodo.org/record/7002879.
[19] Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, and Jian Wu. Scan-
Bank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses
and Dissertations. In
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL),
pages 180–191. IEEE Computer Society, 2021.</ref_text>
		<ref_text bbox="[158.31, 287.57, 1545.39, 2044.84]" conf="0.95" pg_no="97">[20] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of
pages 4171–4186, 2019.
NAACL-HLT,
[21] Diederik P Kingma and Max Welling. Auto-encoding Variational Bayes. In
International
20164.
Conference on Learning Representations (ICLR),
[22] Frank Lebourgeois, Zbigniew Bublinski, and Hubert Emptoz. A fast and efficient
method for extracting text paragraphs and graphics from unconstrained documents.
In
11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B:
volume 1, pages 272–273. IEEE Com-
Pattern Recognition Methodology and Systems,
puter Society, 1992.
[23] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Table-
Bank: A Benchmark Dataset for Table Detection and Recognition. In
Proceedings of
pages 1918–1925, 2020.
The 12th language resources and evaluation conference,
[24] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming
Zhou. DocBank: A Benchmark Dataset for Document Layout Analysis. In
Proceedings
pages 949–960, 2020.
of the 28th International Conference on Computational Linguistics,
[25] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. 2019.
arXiv preprint arXiv:1907.11692,
[26] Patrice Lopez and et al. GROBID. 2008–
https://github.com/kermitt2/grobid,
2022.
[27] Yishu Miao, Lei Yu, and Phil Blunsom. Neural Variational Inference for Text Processing.
In Maria Florina Balcan and Kilian Q. Weinberger, editors,
Proceedings of The 33rd</ref_text>
		<ref_text bbox="[158.88, 289.76, 1543.49, 2030.35]" conf="0.94" pg_no="98">volume 48 of
International Conference on Machine Learning, Proceedings of Machine
pages 1727–1736, New York, New York, USA, 20–22 Jun 2016.
Learning Research,
PMLR.
[28] Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering Discrete Latent
Topics with Neural Variational Inference. In
International Conference on Machine
pages 2410–2419. PMLR, 2017.
Learning,
[29] Aniket Prabhune and Edward A Fox. XML for ETDs. Technical report, Department
of Computer Science, Virginia Polytechnic Institute &amp; State University, 2002.
[30] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.
arXiv preprint
2018.
arXiv:1804.02767,
[31] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese
BERT-networks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
pages 3982–3992, Hong Kong, China, November 2019.
Processing (EMNLP-IJCNLP),
Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL
https:
//aclanthology.org/D19-1410.
[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks.
Advances in neural information
28, 2015.
processing systems,
[33] Lamia Salsabil, Jian Wu, Muntabir Hasan Choudhury, William A Ingram, Edward A
Fox, Sarah M Rajtmajer, and C Lee Giles. A Study of Computational Reproducibility
using URLs Linking to Open Access Datasets and Software. In
Companion Proceedings
pages 784–788, 2022.
of the Web Conference 2022,</ref_text>
		<ref_text bbox="[159.86, 285.08, 1540.37, 2026.95]" conf="0.95" pg_no="99">[34] Zejiang Shen, Kaixuan Zhang, and Melissa Dell. A large dataset of historical Japanese
documents with complex layouts. In
Proceedings of the IEEE/CVF Conference on
pages 548–549, 2020.
Computer Vision and Pattern Recognition Workshops,
[35] Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carl-
son, and Weining Li. LayoutParser: A unified toolkit for deep learning based document
image analysis. In
International Conference on Document Analysis and Recognition,
pages 131–146. Springer, 2021.
[36] Akash Srivastava and Charles Sutton. Autoencoding Variational Inference for Topic
Models. In 2017.
5th International Conference on Learning Representations,
[37] Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and An-
tonio Candelieri. OCTIS: Comparing and Optimizing Topic models is Simple! In
Proceedings of the 16th Conference of the European Chapter of the Association for
pages 263–270, 2021.
Computational Linguistics: System Demonstrations,
[38] Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Łukasz
Bolikowski. CERMINE: automatic extraction of structured metadata from scientific
literature. 18
International Journal on Document Analysis and Recognition (IJDAR),
(4):317–335, 2015.
[39] Sami Uddin, Bipasha Banerjee, Jian Wu, William A Ingram, and Edward A Fox. Build-
ing A Large Collection of Multi-domain Electronic Theses and Dissertations. In
2021
pages 6043–6045. IEEE, 2021.
IEEE International Conference on Big Data (Big Data),
[40] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. YOLOv7: Trainable
bag-of-freebies sets new state-of-the-art for real-time object detectors.
arXiv preprint
2022.
arXiv:2207.02696,</ref_text>
		<ref_text bbox="[158.13, 285.96, 1539.87, 2028.47]" conf="0.95" pg_no="100">[41] Lucy Lu Wang, Isabel Cachola, Jonathan Bragg, Evie Yu-Yen Cheng, Chelsea Haupt,
Matt Latzke, Bailey Kuehl, Madeleine N van Zuylen, Linda Wagner, and Daniel Weld.
SciA11y: Converting Scientific Papers to Accessible HTML. In
The 23rd International
pages 1–4, 2021.
ACM SIGACCESS Conference on Computers and Accessibility,
[42] Papers with Code. Real-Time Object Detection on COCO.
https://paperswithcode.
2022.
com/sota/real-time-object-detection-on-coco,
[43] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. De-
tectron2. 2019.
https://github.com/facebookresearch/detectron2,
[44] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated
Residual Transformations for Deep Neural Networks. In
Proceedings of the IEEE con-
pages 1492–1500, 2017.
ference on computer vision and pattern recognition,
[45] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei
Florencio, Cha Zhang, Wanxiang Che, et al. LayoutLMv2: Multi-modal Pre-training for
Visually-rich Document Understanding. In
Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference
pages 2579–2591, 2021.
on Natural Language Processing (Volume 1: Long Papers),
[46] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay-
outLM: Pre-training of Text and Layout for Document Image Understanding. In
Pro-
ceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery
pages 1192–1200, 2020.
&amp; Data Mining,
[47] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. PubLayNet: largest dataset ever
for document layout analysis. In
2019 International Conference on Document Analysis
pages 1015–1022. IEEE, 2019.
and Recognition (ICDAR),</ref_text>
		<ref_text bbox="[164.01, 268.77, 1532.79, 741.42]" conf="0.87" pg_no="101">[48] Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. Radi-
cally lower data-labeling costs for visually rich document extraction models.
CoRR,
abs/2210.16391, 2022. doi: 10.48550/arXiv.2210.16391.
[49] Kecheng Zhu, Zachary Gager, Shelby Neal, Jiangyue Li, and You Peng. Object Detec-
tion, 2022. URL Virginia Tech CS4624 team
http://hdl.handle.net/10919/109979.
term project.</ref_text>
	</back>
</etd>
