CS5604 - Information Retrieval
Alternate title: Search Engines and Text Mining
Alternate title: Applied Machine Learning (Underlying Google)
Why take CS5604?
- To prepare you for working at Google, Microsoft, or any company
involved in machine learning, text analytics, searching, and/or WWW.
- To prepare you for research involving search engines, social
media, natural language processing, text mining, classification,
clustering, indexing, recommendation/personalization, web archiving, and/or information extraction/seeking/exploration.
- To gain proficiency with parallel processing on clusters with big nodes.
Resources
- 20+ node Hadoop Cluster with 10Gbit network connection
- Cloudera software including HBase, HDFS, Hive, Mahout,
MapReduce, Nutch, Pig, Solr, Spark, Sqoop`
- MeTA,
NLTK, and Python toolkits
References
- Textbook:
Introduction to Information Retrieval by Christopher D.
Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008, 496 pages,
Cambridge University Press, ISBN-10: 0521865719, ISBN-13:
978-0521865715. See also online versions, slides, etc.
- Free (through Library download), recommended:
ChengXiang Zhai and Sean Massung. 2016. Text Data Management and
Analysis: a Practical Introduction to Information Retrieval and Text
Mining. Association for Computing Machinery and Morgan & Claypool, New
York, NY, USA.
- VTechWorks reports from CS5604 projects
Course Organization
- CS5604 Fall 2017 class: CRN 82613, TR, 3:30-4:45pm, NCB 210, 15T
- Approach: problem/project based learning
- Goal: solve the following question: How can we best build
a state-of-the-art information retrieval and analysis system in
support of the
GETAR (Global Event and Trend Archive Research)
project (NSF IIS grant 1619028)?
- The students in the class will confront this driving question,
working in teams, with the teams cooperating, but with each team
focused on a particular collection of data (made up of both webpages
and tweets, relating to a particular event or trend).
- The instructor and several GRAs working on related research
will provide guidance and assistance.
- This is one of the courses leading to a
XCaliber Award "for making extraordinary contributions to technology
enriched active learning".
About the Instructor
- Professor Edward A. Fox, fox@vt.edu, 540-231-5113, Torg. 2160G.
Office hours are Wed 2-5, Thu 1-3, or by appointment.
- Dr. Fox is an IEEE Fellow: for leadership in digital libraries and information retrieval
- Dr. Fox's 1983 Ph.D. was supervised by Prof. Gerard Salton at
Cornell University, often called "the father of information
retrieval".
- GRAs, working in 2030 Torg.:
- Liuqing Li, liuqing@vt.edu
- Xuan Zhang, xuancs@vt.edu
Author:
Edward A. Fox (CV, directions, hours,
photo)
Curator:
Virginia Tech
; Dept.
of Computer Science
Last Updated: August 31, 2017
Email: fox@vt.edu
© Edward A. Fox 2017