CS5604 - Information Retrieval - Fall 2023
Alternate title: Search Engines and Text Mining with Big Data
Alternate title: Applied Machine Learning (Underlying Google)
Why take CS5604?
- To prepare you for working at Google, Microsoft, or any company
involved in machine learning, text analytics, searching, and/or WWW.
- To prepare you for research involving search engines,
natural language processing, LLMs, text mining, classification,
clustering, indexing, recommendation/personalization,
information extraction/seeking/exploration, etc.
- To gain proficiency with the latest software engineering practices,
including containers, Docker, Kubernetes, CI/CD.
Resources
References
- Textbook:
Introduction to Information Retrieval by Christopher D.
Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008, 496 pages,
Cambridge University Press, ISBN-10: 0521865719, ISBN-13:
978-0521865715. See also online versions, slides, etc.
- Free (through Library download), recommended:
ChengXiang Zhai and Sean Massung. 2016. Text Data Management and
Analysis: a Practical Introduction to Information Retrieval and Text
Mining. Association for Computing Machinery and Morgan & Claypool, New
York, NY, USA.
- VTechWorks reports from CS5604 projects
- Selected related works among the
VTechWorks reports from CS4624 projects
- Dhanush Dinesh's report on scalable bulk processing
- From Aman Ahuja's doctoral research, a demonstration of HTML results and
related user manual
Course Organization
- CS5604 Fall 2023 class: CRN 83500, TuTh 9:30-1045am, Surge 103A
- Pre-requisite: a course on data structures,
or permission of instructor
- Approach: problem/project based learning, teams, online, flipped classroom
- Goal: solve the following question: How can we best build
a state-of-the-art information retrieval and analysis system in
support of the communities interested in
all the nation's electronic theses/dissertations (ETDs) -
related to
an IMLS grant to VT and ODU for 8/1/2019 - 7/31/2023
- The students in the class will confront this driving question,
working in teams, with the teams cooperating, as they co-design
a working system that can handle the collection.
- There will be teams covering topics such as
ingesting content,
indexing and searching,
recommendation,
document analysis with object detection,
NLP and large language models,
summarization,
classification,
question-answering with a knowledge base,
clustering, topic analysis,
UX/interface (including for accessibility and with chat) development,
and usability testing.
- The instructor and several GRAs working on related research
will provide guidance and assistance.
- This is one of the courses leading to a
XCaliber Award "for making extraordinary contributions to technology
enriched active learning".
About the Instructor
- Professor Edward A. Fox, fox@vt.edu, 540-231-5113, Torg. 2160G.
Office hours are Tue 11-5, or by appointment.
- Dr. Fox is an ACM fellow as well as an IEEE Fellow: for
contributions and leadership in
information retrieval
and digital libraries
- Dr. Fox's 1983 Ph.D. was supervised by Prof. Gerard Salton at
Cornell University, often called "the father of information
retrieval".
- GTA: Xiao Liang, xliangvt@vt.edu
Author:
Edward A. Fox (CV, directions, hours,
photo)
Curator:
Virginia Tech
; Dept.
of Computer Science
Last Updated: August 30, 2023
Email: fox@vt.edu
© Edward A. Fox 2023