CS5604 - Information Retrieval - Fall 2019
Alternate title: Search Engines and Text Mining with Big Data
Alternate title: Applied Machine Learning (Underlying Google)
Why take CS5604?
- To prepare you for working at Google, Microsoft, or any company
involved in machine learning, text analytics, searching, and/or WWW.
- To prepare you for research involving search engines,
natural language processing, text mining, classification,
clustering, indexing, recommendation/personalization,
information extraction/seeking/exploration,
social media,
and/or
web archiving.
- To gain proficiency with parallel processing on clusters with big
data nodes.
Resources
- 20+ node Hadoop Cluster with 10Gbit network connection
- Cloudera software including HBase, HDFS, Hive, Mahout,
MapReduce, Nutch, Pig, Solr, Spark, Sqoop`
- MeTA,
NLTK, and Python toolkits
References
- Textbook:
Introduction to Information Retrieval by Christopher D.
Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008, 496 pages,
Cambridge University Press, ISBN-10: 0521865719, ISBN-13:
978-0521865715. See also online versions, slides, etc.
- Free (through Library download), recommended:
ChengXiang Zhai and Sean Massung. 2016. Text Data Management and
Analysis: a Practical Introduction to Information Retrieval and Text
Mining. Association for Computing Machinery and Morgan & Claypool, New
York, NY, USA.
- VTechWorks reports from CS5604 projects
Course Organization
- CS5604 Fall 2019 class: CRN 82915, TR, 3:30-4:45pm, McB 231, 15T
- Approach: problem/project based learning, flipped classroom
- Goal: solve the following question: How can we best build
a state-of-the-art information retrieval and analysis system in
support of the communities interested in each of
- All the nation's electronic theses/dissertations (ETDs) -
related to
an IMLS grant to VT and ODU for 8/1/2019 - 7/31/2022
- Big business and the addiction crisis (starting with
14M documents related to tobacco companies and legal suits)
- The students in the class will confront this driving question,
working in teams, with the teams cooperating, as they co-design
a working system that can handle the two collections.
- There will be teams for ingesting content, indexing and
searching (with ElasticSearch), clustering, topic analysis,
and UX/interface development
- The instructor and several GRAs working on related research
will provide guidance and assistance.
- This is one of the courses leading to a
XCaliber Award "for making extraordinary contributions to technology
enriched active learning".
About the Instructor
- Professor Edward A. Fox, fox@vt.edu, 540-231-5113, Torg. 2160G.
Office hours are Tue/Thu 12:30-3, or by appointment.
- Dr. Fox is an ACM fellow as well as an IEEE Fellow: for
contributions and leadership in
information retrieval
and digital libraries
- Dr. Fox's 1983 Ph.D. was supervised by Prof. Gerard Salton at
Cornell University, often called "the father of information
retrieval".
- GRA, working in 2030 Torg.:
Ziqian Song, ziqian@vt.edu
Author:
Edward A. Fox (CV, directions, hours,
photo)
Curator:
Virginia Tech
; Dept.
of Computer Science
Last Updated: August 1, 2019
Email: fox@vt.edu
© Edward A. Fox 2019