CS5604 - Information Retrieval - Fall 2023

Alternate title: Search Engines and Text Mining with Big Data

To prepare you for working at Google, Microsoft, or any company involved in machine learning, text analytics, searching, and/or WWW.
To prepare you for research involving search engines, natural language processing, LLMs, text mining, classification, clustering, indexing, recommendation/personalization, information extraction/seeking/exploration, etc.
To gain proficiency with the latest software engineering practices, including containers, Docker, Kubernetes, CI/CD.

Very large document collection: Electronic theses and dissertations from around the nation and beyond, in collaboration with an IMLS-funded project and University Libraries (see local story)
CS container cluster with GPUs: https://launch.cs.vt.edu, explained in the Guide: https://wiki.cs.vt.edu/index.php/Cloud_Quickstart
CS VM cluster: http://csrvm.cs.vt.edu/
Docker Hub containers: http://hub.docker.com, ElasticSearch, Kibana, Python toolkits, NLTK, and other tools, libraries, etc. as needed

Textbook: Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008, 496 pages, Cambridge University Press, ISBN-10: 0521865719, ISBN-13: 978-0521865715. See also online versions, slides, etc.
Free (through Library download), recommended: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association for Computing Machinery and Morgan & Claypool, New York, NY, USA.
VTechWorks reports from CS5604 projects
Selected related works among the VTechWorks reports from CS4624 projects
Dhanush Dinesh's report on scalable bulk processing
From Aman Ahuja's doctoral research, a demonstration of HTML results and related user manual

CS5604 Fall 2023 class: CRN 83500, TuTh 9:30-1045am, Surge 103A
Pre-requisite: a course on data structures, or permission of instructor
Approach: problem/project based learning, teams, online, flipped classroom
Goal: solve the following question: How can we best build a state-of-the-art information retrieval and analysis system in support of the communities interested in all the nation's electronic theses/dissertations (ETDs) - related to an IMLS grant to VT and ODU for 8/1/2019 - 7/31/2023
The students in the class will confront this driving question, working in teams, with the teams cooperating, as they co-design a working system that can handle the collection.
There will be teams covering topics such as ingesting content, indexing and searching, recommendation, document analysis with object detection, NLP and large language models, summarization, classification, question-answering with a knowledge base, clustering, topic analysis, UX/interface (including for accessibility and with chat) development, and usability testing.
The instructor and several GRAs working on related research will provide guidance and assistance.
This is one of the courses leading to a XCaliber Award "for making extraordinary contributions to technology enriched active learning".

Professor Edward A. Fox, fox@vt.edu, 540-231-5113, Torg. 2160G. Office hours are Tue 11-5, or by appointment.
Dr. Fox is an ACM fellow as well as an IEEE Fellow: for contributions and leadership in information retrieval and digital libraries
Dr. Fox's 1983 Ph.D. was supervised by Prof. Gerard Salton at Cornell University, often called "the father of information retrieval".
GTA: Xiao Liang, xliangvt@vt.edu

Author: Edward A. Fox (CV, directions, hours, photo)
Curator: Virginia Tech ; Dept. of Computer Science
Last Updated: August 30, 2023
Email: fox@vt.edu