Fall 2013
Mondays 11:45-1:45
Location TBD
Instructor: Andrew Rosenberg (andrew_at_cs.qc.cuny.edu)
Office Hours: by appointment
Course Description
Due to the vast amount of available language data, the Web both enables and benefits from machine learning and natural language processing techniques. This course will cover 1) seminal and state-of-the-art approaches to language understanding that are robust and/or scalable, 2) machine learning and data analysis technologies that are well-suited to web data including online training, ranking, active learning and outlier detection, 3) core web technologies and APIs, and 4) ensemble methods for merging evidence from disparate sources.
This course satisfyies the "Corpus Analysis" or "Advanced Natural Language Processing" requirement of the CUNY Graduate Center Computational Linguistics MA/PhD Certificate Program. Linguistics students must have successfully completed Methods in Computational Linguistics I and II. Completion of Language Technology is also strongly recommended.
An editorial note: I will do my best to balance the needs of
Linguistics and Computer Science students. Regarding the programming
expectations: assignments will include programming from scratch as
well as using external packages and resources. All students should
feel comfortable converting an algorithm or technique that is
presented in pseudocode, diagrams, or natural language into code.
Lecture material will not cover specific implementation issues, or
data structures. It is the students responsibility to solve
implementational issues independently. Similarly, it is the students
responsibility to be able to install, run and interact with external
packages and resources.
Schedule
|
Date
|
Material
|
Assignments
|
| Monday, September 2 |
Labor Day. No Classes |
| Monday, September 9 |
Week 1 - Because that's where the money is.
|
| Monday, September 16 |
Week 2 - Document Classification and Word Representations |
| Monday, September 23 |
Week 3 - Web APIs and Multimodal Processing |
| Monday, September 30 |
Week 4 - Recommendation Systems |
| Monday, October 7 |
Week 5 - Question Answering |
| Tuesday, October 15 |
Week 6 - Language Modeling |
| Monday, October 21 |
Week 7 - Sentiment Analysis |
| Monday, October 28 |
Week 8 - Information Retrieval and Ranking |
| Monday, November 4 |
Week 9 - Ensemble Methods |
| Monday, November 11 |
Week 10 - Outlier Detection |
| Monday, November 18 |
Week 11 - Crowdsourcing |
| Monday, November 25 |
Week 12 - Clustering |
| Monday, December 2 |
Week 13 - Modeling User Behavior, Parallelization Small and Large |
| Monday, December 9 |
Week 14 - Student Presentations |
Learning Outcomes
Upon successful completion of this course, a student can expect to be able:
- To Come
Textbook
TBD
Class Policies
Come to Class. A major component of this class is participation and presentation.
Cell phones must be on silent, and are not to be checked or used during class - if you are expecting an urgent call, tell the instructor at the start of class.
No laptops, tablets or lab computers.
Cell phone and Laptop policy: One warning, after that 5 points off the next homework for each issue.
Grading Policy
Assignments: 60% (4 x 15%)
Final Project: 40%
The Final Letter Grade will be based on a scaled adjustment of the Final Numeric Grade. When the scale has been determined, the class will be informed either in class or over email, and it will be posted to the course webpage (here).
Assignment Policy
Do not cheat. You may discuss assignments with your classmates, but write or program your assignment alone. Do not ask for or offer to share code, or written assignments. If you discuss an assignment with a classmate, or on an online forum, include the name of the classmate or URL of the forum on your assignment or in the documentation of your code. The first instance of cheating results in an automatic zero for the assignment (or final project). A second instance of cheating results in a zero (F) for the course. The Computer Science Department will be notified in writing of all instances of cheating. On a second instance a report will be submitted to the Office of Academic Integrity.
Assignments will be posted to the website (here) after class the date that they are assigned.
All assignments will be scored out of 100 points.
There are 5 assignments. Each assignment will have a theoretical (pen-and-paper) component. Assignments may also include an implementation (coding) component.
Assignments will be due by 11:59pm on their due date. Assignments should be delivered electronically, via email.
No late assignments will be accepted.
If an extension is needed let me know as early as possible. I will do my best to be reasonable to you and fair to the rest of class. No extensions will be granted after 24 hours before the assignment is due.
Coding Assignments
If there are programming requirements to any assignment, coding assignments can be written in C++, java or python.
In general, grading will be 65% Implementation (compilation, passing tests, implementational details) and 35% Documentation and Style. This may be adjusted for some assignments. Always read the assignment for the grading breakdown.
Detailed requirements will accompany each assignment. The instructions and requirements on a particular assignment always take precedence over the general guidelines on the course website.
Submission of coding assignments should be performed over electronically. Submitting multiple times is fine. The latest assignment submitted on time will be graded. If you submit an assignment late, after submitting an assignment on time, you must let me know, via email, that you would like the late submission graded for the assignment.
README guidelines
Each coding assignment will require a README file as a component of its documentation. A README file should provide a high-level description of your assignment, or project.
A successful README file will include the following:
- A description of the problem addressed -- in plain English.
- A description of your solution to the problem -- again, in plain English.
- If you feel that either of these descriptions can benefit from the inclusion of code, include pseudocode rather than a verbatim code listing.
- A description of each file that is part of the submission.
- Information about how to use your code -- instructions for compilation and running your code from the command line, if structured as an API, how to use implemented methods (arguments, preconditions, postconditions)
- Indication of any areas where your code differs from the assignment requirements -- any area of incompleteness, different method signatures, different command line parameters, etc.
Written Assignments
Written Assignments should also be delivered electronically, via email or google docs.
Electronic copies must be in one of the following formats: .pdf, Microsoft Word .doc, Google Docs.
Points for each question will be described in each assignment.
Incomplete Policy
In extenuating circumstances, students may be given an Incomplete if material has not been completed by the end of the semester. When an incomplete is granted, the student and instructor will specify, in writing, a timeframe for all outstanding material to be submitted. If no other timeframe has been specified in writing, the deadline for all outstanding material to be submitted to resolve an incomplete will be one month following the last meeting of the class. This semester, that would make the deadline: TBD. An incomplete that is not resolved by the deadline will become an F.
Final Project
The Final Project will be an original research project. Possible project ideas will be presented in class. Part of the project will be a short (5-10 minute) presentation of your work.
The goal of the project is to perform a research project incorporating Natural Language Processing, Machine Learning and Web Technologies. Acceptable project ideas will involve either a modification to an existing approach to a problem, or a novel problem entirely. Note: a successful project does not need to generate state-of-the-art results. Novelty, however, is expected. A short, 4 page, report on the algorithm, dataset/problem, and evaluation is expected as part of the project.