25th March 2018
Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 02" as the subject.
This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.
In this assignment you will learn how to generate random sentences using n-grams as a language modeling method. You are also required to implement Laplace smoothing, and get to compare how intuitive the generated sentences seem with and without incorporating this smoothing technique.
The assignment consists of the following files (bold are to be implemented):
File | Description |
---|---|
data/holbrook-tagged.dat | Training corpus |
data/simple.dat | A small part of the training corpus (useful for faster debugging of your solution) |
python/Generate.py | Main runnable script that prepares the data, trains the models and generates sentences. |
python/UnigramLanguageModel.py | Already implemented unigram model (this is how your solution should look). |
python/BigramLanguageModel.py | Bigram language model. |
python/TrigramLanguageModel.py | Trigram language model. |
python/LaplaceUnigramLanguageModel.py | Unigram language model using Laplace smoothing. |
python/LaplaceBigramLanguageModel.py | Bigram language model using Laplace smoothing. |
python/HolbrookCorpus.py | Loading and representation of the corpus. |
python/Datum.py | One unit (token) of the corpus. |
python/Sentence.py | List of datums. |
python/WeightedChoice.py | A function to pick a word based on probability distribution. |
Your task is to implement the models shown in bold above.
25% for each model you implement (may be subject to slight variations).
10% bonus for additional computation of the perplexity of models. Split the data into training and test sets (only for this purpose) and compute the average perplexity over sentences.