Virginia Tech CS4984: Computational Linguistics

Instructor: Edward A. Fox

Description:

With support from a grant from the National Science Foundation, Computing in Context, NSF DUE-1141209, and resulting subaward from Villanova to Virginia Tech, this course will give students the opportunity to engage in active learning about how to work with large collections of text, one aspect of 'big data'.

An 11-node Hadoop cluster, along with other tailored computing resources, will aid handling of over 500 million tweets and over 11 terabytes of webpages. Using methods employed in search engines, including linguistic analysis and natural language processing, as well as statistical techniques, students will engage in problem based learning with the semester long challenge of analyzing content collections automatically, extracting key information, and generating easily readable summaries of important events in English. Just-in-time learning will allow development of an understanding of concepts, techniques, and toolkits so students will master the key methods related to computational linguistics (CL).

Instructor:

Professor Edward A. Fox, fox @ vt.edu, http://fox.cs.vt.edu, 540-231-5113

Prerequisites:

senior standing in CS, or instructor permission

Topics:

Lexical, syntactic, semantic, discourse, and statistical analysis of texts
Automatic text generation
Natural Language Toolkit
Tweet and webpage analysis
Indexing (stopwords, stemming/lemmatization, morphology, phrases)
Named entity recognition and extraction
Ontology building and utilization
Cluster-based processing with Hadoop, Solr, and other tools

Evaluation:

70% team term project (sum of: 35% modules - focused on iterative refinement of term project solutions, 10% final presentation, 25% project report - released in VTechWorks; with adjustment based on team peer assessment)
10% midterm exam
20% final exam

Different Aspects of the Common Project:

All students will work with some portion of the 11TB of webpages and the 500M tweets collected in connection with the NSF-funded IDEAL project.
Students will work in groups of 4-5, preferably each group having people covering a mix of skills, e.g., Python experience, exposure to linguistics.
Each group will pick a particular class of events, e.g., hurricane, earthquake, political election.
Each group will automatically (i.e., with appropriate tools or programs) identify relevant parts of the available content, and implement ways to generate summaries for instances of their chosen class of events.

Tools:

Students will learn how to use each of the key commonly employed CL tools.
They will learn them when they are needed.
Learning about a tool will be aided by a module, like those used in the Digital Library Curriculum project. It will refer to YouTube videos/lectures, tutorials, papers, primers, etc.
Tools also will include those used for webpage and tweet processing.
Tools also will include those used in our Hadoop cluster.

Prototypes, Iterative Refinement :

Students will devise a rapid prototype with naive assumptions in the first two weeks of the course.
Students will implement a series of ever better versions during the course.
Each version will be more complex and yield higher quality results.
Thus, they will rapidly achieve full success, but will see how to improve in stages, achieving useful intermediate goals along the way.

Programming:

Students will use NLTK and program in Python.
Students will learn high-level languages used with the various tools.

References:

Textbook: Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly, 2009. ISBN: 0596516495. Free version at http://www.nltk.org/book3/. See also http://shop.oreilly.com/product/9780596516499.do and http://www.nltk.org/
Free book: Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool. 2010. 177 pages. ISBN: 9781608453429. DOI: 10.2200/S00274ED1V01Y201006HLT007. http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007. Note that all of the M&C books can be freely downloaded if accessed on campus, or through the off campus library sign in at http://www.lib.vt.edu.
Other references, as appropriate will be used, each discussed in the related curricular modules.

Connection with Ensemble:

Through this course, students will learn more about using online educational resources.
Further, from this course will come a collection in Ensemble ( computingportal.org).
This collection will be usable by others who want to learn more about computational linguistics, as well as those who will teach CL.
Instructors should be able to easily tailor a new course from the collection of educational resources.
This collection also will be a part of the Digital Library Curriculum, previously funded by NSF, and accessible too in Wikiversity.

Logistics for Fall 2014:

CRN: 88630; CS-4984; Title: SS:Computational Linguistics
M W 4-5:15pm, Randolph 120; enrollment expected: 35
Final 16M: Dec. 16, 3:25-5:25pm

Last updated 7/4/2014