hide show

This is the course website for the 2017/2018 academic year. See the current iteration.

Natural Language Processing

Task 02: N-grams

Deadline

25th March 2018

Submission

Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 02" as the subject.

Acknowledgment

This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.

Description

In this assignment you will learn how to generate random sentences using n-grams as a language modeling method. You are also required to implement Laplace smoothing, and get to compare how intuitive the generated sentences seem with and without incorporating this smoothing technique.

Files

The assignment consists of the following files (bold are to be implemented):

File	Description
data/holbrook-tagged.dat	Training corpus
data/simple.dat	A small part of the training corpus (useful for faster debugging of your solution)
python/Generate.py	Main runnable script that prepares the data, trains the models and generates sentences.
python/UnigramLanguageModel.py	Already implemented unigram model (this is how your solution should look).
python/BigramLanguageModel.py	Bigram language model.
python/TrigramLanguageModel.py	Trigram language model.
python/LaplaceUnigramLanguageModel.py	Unigram language model using Laplace smoothing.
python/LaplaceBigramLanguageModel.py	Bigram language model using Laplace smoothing.
python/HolbrookCorpus.py	Loading and representation of the corpus.
python/Datum.py	One unit (token) of the corpus.
python/Sentence.py	List of datums.
python/WeightedChoice.py	A function to pick a word based on probability distribution.

Goal

Your task is to implement the models shown in bold above.

Grading

25% for each model you implement (may be subject to slight variations).

10% bonus for additional computation of the perplexity of models. Split the data into training and test sets (only for this purpose) and compute the average perplexity over sentences.