html Natural Language Processing, Machine Learning and The Web
Natural Language Processing,
Machine Learning and The Web

Fall 2013
Mondays 11:45-1:45
Location TBD
Instructor: Andrew Rosenberg (andrew_at_cs.qc.cuny.edu)
Office Hours: by appointment

Course Description

Due to the vast amount of available language data, the Web both enables and benefits from machine learning and natural language processing techniques. This course will cover 1) seminal and state-of-the-art approaches to language understanding that are robust and/or scalable, 2) machine learning and data analysis technologies that are well-suited to web data including online training, ranking, active learning and outlier detection, 3) core web technologies and APIs, and 4) ensemble methods for merging evidence from disparate sources.

This course satisfyies the "Corpus Analysis" or "Advanced Natural Language Processing" requirement of the CUNY Graduate Center Computational Linguistics MA/PhD Certificate Program. Linguistics students must have successfully completed Methods in Computational Linguistics I and II. Completion of Language Technology is also strongly recommended.

An editorial note: I will do my best to balance the needs of Linguistics and Computer Science students. Regarding the programming expectations: assignments will include programming from scratch as well as using external packages and resources. All students should feel comfortable converting an algorithm or technique that is presented in pseudocode, diagrams, or natural language into code. Lecture material will not cover specific implementation issues, or data structures. It is the students responsibility to solve implementational issues independently. Similarly, it is the students responsibility to be able to install, run and interact with external packages and resources.

Schedule

Date Material Assignments
Monday, September 2 Labor Day. No Classes
Monday, September 9 Week 1 - Because that's where the money is.
Monday, September 16 Week 2 - Document Classification and Word Representations
Monday, September 23 Week 3 - Web APIs and Multimodal Processing
Monday, September 30 Week 4 - Recommendation Systems
Monday, October 7 Week 5 - Question Answering
Tuesday, October 15 Week 6 - Language Modeling
Monday, October 21 Week 7 - Sentiment Analysis
Monday, October 28 Week 8 - Information Retrieval and Ranking
Monday, November 4 Week 9 - Ensemble Methods
Monday, November 11 Week 10 - Outlier Detection
Monday, November 18 Week 11 - Crowdsourcing
Monday, November 25 Week 12 - Clustering
Monday, December 2 Week 13 - Modeling User Behavior, Parallelization Small and Large
Monday, December 9 Week 14 - Student Presentations

Learning Outcomes

Upon successful completion of this course, a student can expect to be able:

  1. To Come

Textbook

TBD

Class Policies

Come to Class. A major component of this class is participation and presentation.

Cell phones must be on silent, and are not to be checked or used during class - if you are expecting an urgent call, tell the instructor at the start of class.

No laptops, tablets or lab computers.

Cell phone and Laptop policy: One warning, after that 5 points off the next homework for each issue.

Grading Policy

Assignments: 60% (4 x 15%)

Final Project: 40%

The Final Letter Grade will be based on a scaled adjustment of the Final Numeric Grade. When the scale has been determined, the class will be informed either in class or over email, and it will be posted to the course webpage (here).

Assignment Policy

Do not cheat. You may discuss assignments with your classmates, but write or program your assignment alone. Do not ask for or offer to share code, or written assignments. If you discuss an assignment with a classmate, or on an online forum, include the name of the classmate or URL of the forum on your assignment or in the documentation of your code. The first instance of cheating results in an automatic zero for the assignment (or final project). A second instance of cheating results in a zero (F) for the course. The Computer Science Department will be notified in writing of all instances of cheating. On a second instance a report will be submitted to the Office of Academic Integrity.

Assignments will be posted to the website (here) after class the date that they are assigned.

All assignments will be scored out of 100 points.

There are 5 assignments. Each assignment will have a theoretical (pen-and-paper) component. Assignments may also include an implementation (coding) component.

Assignments will be due by 11:59pm on their due date. Assignments should be delivered electronically, via email.

No late assignments will be accepted. If an extension is needed let me know as early as possible. I will do my best to be reasonable to you and fair to the rest of class. No extensions will be granted after 24 hours before the assignment is due.

Coding Assignments

If there are programming requirements to any assignment, coding assignments can be written in C++, java or python.

In general, grading will be 65% Implementation (compilation, passing tests, implementational details) and 35% Documentation and Style. This may be adjusted for some assignments. Always read the assignment for the grading breakdown.

Detailed requirements will accompany each assignment. The instructions and requirements on a particular assignment always take precedence over the general guidelines on the course website.

Submission of coding assignments should be performed over electronically. Submitting multiple times is fine. The latest assignment submitted on time will be graded. If you submit an assignment late, after submitting an assignment on time, you must let me know, via email, that you would like the late submission graded for the assignment.

README guidelines

Each coding assignment will require a README file as a component of its documentation. A README file should provide a high-level description of your assignment, or project.

A successful README file will include the following:

Written Assignments

Written Assignments should also be delivered electronically, via email or google docs.

Electronic copies must be in one of the following formats: .pdf, Microsoft Word .doc, Google Docs.

Points for each question will be described in each assignment.

Incomplete Policy

In extenuating circumstances, students may be given an Incomplete if material has not been completed by the end of the semester. When an incomplete is granted, the student and instructor will specify, in writing, a timeframe for all outstanding material to be submitted. If no other timeframe has been specified in writing, the deadline for all outstanding material to be submitted to resolve an incomplete will be one month following the last meeting of the class. This semester, that would make the deadline: TBD. An incomplete that is not resolved by the deadline will become an F.

Final Project

The Final Project will be an original research project. Possible project ideas will be presented in class. Part of the project will be a short (5-10 minute) presentation of your work.

The goal of the project is to perform a research project incorporating Natural Language Processing, Machine Learning and Web Technologies. Acceptable project ideas will involve either a modification to an existing approach to a problem, or a novel problem entirely. Note: a successful project does not need to generate state-of-the-art results. Novelty, however, is expected. A short, 4 page, report on the algorithm, dataset/problem, and evaluation is expected as part of the project.