This is the course website for the 2017/2018 academic year. See the current iteration.
.

Natural Language Processing

Task 04: Sentiment Analysis

Deadline

26th April 2018

Submission

Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 04" as the subject.

Acknowledgment

This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.

Description

In this assignment, you will perform sentiment analysis on movie review data taken from IMDB, classifying entire reviews as either positive or negative.

Files

The assignment consists of the following files (bold are to be implemented):

FileDescription
data/imdb1Training data.
data/poldata.README.2.0Data description.
data/english.stopEnglish stop-word list.
python/NaiveBayes.pyMain runnable script that prepares the data and trains and evaluates the model.

Goal

You will be implementing a Naive Bayes model (following the pseudocode in the book Introduction to Information Retrieval by Manning et al., page 260) using Laplace smoothing. The classifier will use words as features, add the log probability score for each token, and make a binary decision. In addition, you will use stop-word filtering: removing common words like "the", "a", "it" from the train and test sets (the list is provided).

The code comes set up for 10-fold cross-validation training and testing. When using a review to train your model, use the fact that it is positive or negative. During testing, you only use this label to compute the accuracy. First, you should train and evaluate your model using the provided cross-validation mechanism. Next, you will evaluate the model again with stop words removed and compare the results fo rthe given dataset.

You should make changes in at least these functions:

To run the code, run the script with the optional -f flag which turns on stop-word filtering, and specify the required positional argument that takes the dataset location:

$ python2 NaiveBayes.py [-f] ../data/imdb1

Grading

Your model should perform with accuracy at least 80%. The grading of the assignment is divided as follows:

Download files