26th April 2018
Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 04" as the subject.
This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.
In this assignment, you will perform sentiment analysis on movie review data taken from IMDB, classifying entire reviews as either positive or negative.
The assignment consists of the following files (bold are to be implemented):
File | Description |
---|---|
data/imdb1 | Training data. |
data/poldata.README.2.0 | Data description. |
data/english.stop | English stop-word list. |
python/NaiveBayes.py | Main runnable script that prepares the data and trains and evaluates the model. |
You will be implementing a Naive Bayes model (following the pseudocode in the book Introduction to Information Retrieval by Manning et al., page 260) using Laplace smoothing. The classifier will use words as features, add the log probability score for each token, and make a binary decision. In addition, you will use stop-word filtering: removing common words like "the", "a", "it" from the train and test sets (the list is provided).
The code comes set up for 10-fold cross-validation training and testing. When using a review to train your model, use the fact that it is positive or negative. During testing, you only use this label to compute the accuracy. First, you should train and evaluate your model using the provided cross-validation mechanism. Next, you will evaluate the model again with stop words removed and compare the results fo rthe given dataset.
You should make changes in at least these functions:
To run the code, run the script with the optional -f flag which turns on stop-word filtering, and specify the required positional argument that takes the dataset location:
$ python2 NaiveBayes.py [-f] ../data/imdb1
Your model should perform with accuracy at least 80%. The grading of the assignment is divided as follows: