LING 83800: Methods in
Computational Linguistics II

Spring 2011
Thursday 11:45am - 1:45pm
GC 7394
Instructor: Andrew Rosenberg (andrew_at_cs.qc.cuny.edu)
Office Hours: By Appointment GC 4420

Course Description

This is the second of a two-part course sequence to train students with a linguistics background in the core methodologies of computational linguistics. Successful completion of this two-course sequence will enable students to take graduate-level elective courses in computational linguistics; both courses are offered by the Graduate Center's Linguistics Program, as well as courses offered by the Computer Science Program. This course will provide training in: the use of computational libraries built specifically for computational linguistics, the techniques used in performing computational analyses of electronic natural language corpora, and the foundational mathematics, probabilistic methods and statistics that are the backbone of modern computational linguistics. The course will go significantly beyond a survey of these topics. By completing the Methods in Computational Linguistics sequence, at the end of the first year, Computational Linguistics Master's students will have the skills they need to engage in further study of state-of-the-art topics in natural language processing.
Successful completion of Methods in Computational Linguistics I is a pre-requisite for this course.

Learning Outcomes

Upon successful completion of this course, a student can expect to:

  1. Be able to extract and analyze statistics from text corpora
  2. Understand foundational tasks in Computational Linguistics -- tagging, parsing, segmentation
  3. Have the ability to comprehend a Computational Linguistics conference paper (ACL, eg.).

Textbook

Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper. O'Reilly.
ISBN: 978-0596516499
Also available online at: http://www.nltk.org/book
This should be available through the bookstore, but may be found through other outlets at a discount.

Class Policies

Come to Class. It will be difficult to do well in the class without regular attendance. There is penalty for missing up to 3 classes. Each missed class more than 3 will reduce the maximum Attendence and Participation grade by 1% up to a minimum maximum of 5%. (To get 5 points of Participation while missing more than 8 classes, you'd better be doing something outrageous when you're there.)

Cell phones must be on silent, and are not to be checked or used during class - if you are expecting an urgent call, tell the instructor at the start of class.

Laptops, tablets or lab computers are welcome in class.

Cell phone policy: One warning, after that 5 points off the next homework for each issue. Same policy for the instructor. One warning, after that, everyone gets 5 points on the next homework.

Grading Policy

Assignments: 60% (4 x 15%)

Attendance and Participation: 10%

Midterm Exam: 10%

Final Exam: 20%

The Final Letter Grade will be based on a scaled adjustment of the Final Numeric Grade. When the scale has been determined, the class will be informed either in class or over email, and it will be posted to the course webpage (here).

Assignment Policy

Do not cheat. You may discuss assignments with your classmates, but write or program your assignment alone. Do not ask for or offer to share code, or written assignments. If you discuss an assignment with a classmate, or on an online forum, include the name of the classmate or URL of the forum on your assignment or in the documentation of your code. The first instance of cheating results in an automatic zero for the assignment (or final project). A second instance of cheating results in a zero (F) for the course. The Computer Science Department will be notified in writing of all instances of cheating. On a second instance a report will be submitted to the Office of Academic Integrity.

Assignments will be posted to the website (here) after class the date that they are assigned.

All assignments will be scored out of 100 points.

There are 4 assignments. Each assignment will have a theoretical (pen-and-paper) component and or an implementation (coding) component.

Assignments will be due by 11:59pm on their due date. Assignments should be delivered electronically, via email.

Deliver assignments with a timestamp before 11:59pm on the due date to avoid a late penalty. If an extension is needed let me know as soon as possible. I will do my best to be reasonable to you and fair to the rest of class.

Incomplete Policy

In extenuating circumstances, students may be given an Incomplete if material has not been completed by the end of the semester. When an incomplete is granted, the student and instructor will specify, in writing, a timeframe for all outstanding material to be submitted. If no other timeframe has been specified in writing, the deadline for all outstanding material to be submitted to resolve an incomplete will be one month following the last meeting of the class. This semester, that would make the deadline: June 18. An incomplete that is not resolved by the deadline will become an F

Schedule

Date Material (Tentative) Assignments
Week 1: Thursday, January 30 Welcome. Introduction to NLTK.
Sorting and Searching.
[pptx]
Read Chapter 1 of NLTK book.
Week 2: Thursday, February 6 Counting Things: Probability, Bayes Rule.
NLTK: FreqDist
[pptx]
Information Retrieval Snippet [pdf]
Assignment 1 Out
Week 3: Thursday, February 13 Counting More Things.
Conditionals
NLTK: ConditionalFreqDist
[pptx]
Sentiment Analysis Snippet [pdf]
Thursday, February 20 No Classes. GC Classes follow Monday Schedule.
Week 4: Thursday, February 27 Matching Things
Regular Expressions.
[pptx]
Speech Recognition Snippet [pptx]
Assignment 1 Due, Assignment 2 Out. Input Files
Week 5: Thursday, March 6 Annotating Things
Corpus Construction.
Part-of-speech tagging
Parsing
More Regular Expressions
[pptx]

Week 6: Thursday, March 13 Relating Things
Assessing word similarity
WordNet
Co-occurences
Word classes
List Comprehensions.
[pptx]
[pptx]
[pdf]
[wordnet demo]
Week 7: Thursday, March 20 Coding Things
Unix utilities
[pdf]
Assignment 2 Due
Week 8: Thursday, March 27 Midterm Exam
Week 9: Thursday, April 3 Dynamic Programming
Minimum Edit Distance
Assignment 3 Out
Week 10: Thursday, April 10 Midterm Review
Classifying Things
Machine Learning in Computational Linguistics
Using NLTK classification routines.
Evaluation
Thursday, April 17 No Class: Spring Recess
Week 11: Thursday, April 24 Classifying Things
Machine Learning in Computational Linguistics
Using NLTK classification routines.
Evaluation
Assignment 3 Due, Assignment 4 Out
Week 12: Thursday, May 1 Part-of-speech Tagging
Dictionaries
Training a Tagger in NLTK
Week 13: Thursday, May 8 Dynamic Programming
Minimum Edit Distance
Week 14: Thursday, May 15 Plotting and Graphics with nltk and matplotlib
Segmentation.
Textual Entailment.
Assignment 4 Due
Week 15: Thursday, May 22 Final Exam