Virginia Tech CS4984: Computational Linguistics
Instructor: Edward A. Fox
Description:
With support from a grant from the National Science Foundation,
Computing in Context,
NSF DUE-1141209,
and resulting subaward from Villanova to Virginia Tech,
this course will give students the opportunity to engage in
active learning about how to work with large collections of text,
one aspect of 'big data'.
An 11-node Hadoop cluster, along with other tailored computing resources,
will aid handling of over 500 million tweets and over 11 terabytes
of webpages.
Using methods employed in search engines,
including linguistic analysis and natural language processing,
as well as statistical techniques, students will engage in problem
based learning with the semester long challenge of analyzing
content collections automatically, extracting key information,
and generating easily readable summaries of important events in English.
Just-in-time learning will allow development of an understanding of
concepts, techniques, and toolkits so students will master the key
methods related to computational linguistics (CL).
Instructor:
Professor Edward A. Fox, fox @ vt.edu,
http://fox.cs.vt.edu, 540-231-5113
Prerequisites:
senior standing in CS, or instructor permission
Topics:
- Lexical, syntactic, semantic, discourse, and statistical analysis of texts
- Automatic text generation
- Natural Language Toolkit
- Tweet and webpage analysis
- Indexing (stopwords, stemming/lemmatization, morphology, phrases)
- Named entity recognition and extraction
- Ontology building and utilization
- Cluster-based processing with Hadoop, Solr, and other tools
Evaluation:
- 70% team term project (sum of:
35% modules - focused on iterative refinement of term project solutions,
10% final presentation, 25% project report -
released in VTechWorks; with adjustment based on team peer assessment)
- 10% midterm exam
- 20% final exam
Different Aspects of the Common Project:
- All students will work with some portion of the 11TB of webpages
and the 500M tweets collected in connection with the NSF-funded
IDEAL project.
- Students will work in groups of 4-5, preferably each group having
people covering a mix of skills, e.g., Python experience, exposure
to linguistics.
- Each group will pick a particular class of events, e.g., hurricane,
earthquake, political election.
- Each group will automatically
(i.e., with appropriate tools or programs)
identify relevant parts of the
available content,
and implement ways to generate summaries for instances of their chosen
class of events.
Tools:
- Students will learn how to use each of the key commonly employed CL tools.
- They will learn them when they are needed.
- Learning about a tool will be aided by a module, like those used in the
Digital Library Curriculum
project.
It will refer to YouTube videos/lectures, tutorials, papers, primers, etc.
- Tools also will include those used for webpage and tweet processing.
- Tools also will include those used in our Hadoop cluster.
Prototypes, Iterative Refinement :
- Students will devise a rapid prototype with naive assumptions in the
first two weeks of the course.
- Students will implement a series of ever better versions during the course.
- Each version will be more complex and yield higher quality results.
- Thus, they will rapidly achieve full success, but will see how to improve
in stages, achieving useful intermediate goals along the way.
Programming:
- Students will use NLTK and program in Python.
- Students will learn high-level languages used with the various tools.
References:
Connection with Ensemble:
- Through this course, students will learn more about using online
educational resources.
- Further, from this course will come a collection in Ensemble
(
computingportal.org).
- This collection will be usable by others who want to learn more
about computational linguistics, as well as those who will teach CL.
- Instructors should be able to easily tailor a new course from
the collection of educational resources.
- This collection also will be a part of the Digital Library Curriculum,
previously funded by NSF, and accessible too in Wikiversity.
Logistics for Fall 2014:
- CRN: 88630; CS-4984; Title: SS:Computational Linguistics
- M W 4-5:15pm, Randolph 120; enrollment expected: 35
- Final 16M: Dec. 16, 3:25-5:25pm
Last updated 7/4/2014