This is the course website for the 2017/2018 academic year. See the current iteration.
.

Natural Language Processing

Task 02: N-grams

Deadline

25th March 2018

Submission

Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 02" as the subject.

Acknowledgment

This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.

Description

In this assignment you will learn how to generate random sentences using n-grams as a language modeling method. You are also required to implement Laplace smoothing, and get to compare how intuitive the generated sentences seem with and without incorporating this smoothing technique.

Files

The assignment consists of the following files (bold are to be implemented):

FileDescription
data/holbrook-tagged.datTraining corpus
data/simple.datA small part of the training corpus (useful for faster debugging of your solution)
python/Generate.pyMain runnable script that prepares the data, trains the models and generates sentences.
python/UnigramLanguageModel.pyAlready implemented unigram model (this is how your solution should look).
python/BigramLanguageModel.pyBigram language model.
python/TrigramLanguageModel.pyTrigram language model.
python/LaplaceUnigramLanguageModel.pyUnigram language model using Laplace smoothing.
python/LaplaceBigramLanguageModel.pyBigram language model using Laplace smoothing.
python/HolbrookCorpus.pyLoading and representation of the corpus.
python/Datum.pyOne unit (token) of the corpus.
python/Sentence.pyList of datums.
python/WeightedChoice.pyA function to pick a word based on probability distribution.

Goal

Your task is to implement the models shown in bold above.

Grading

25% for each model you implement (may be subject to slight variations).

10% bonus for additional computation of the perplexity of models. Split the data into training and test sets (only for this purpose) and compute the average perplexity over sentences.

Download files