1st April 2018
Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 03" as the subject.
This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.
The aim of this assignment is to implement a spell checker using noisy channel. In particular, you will be given the edit model (the likelihood term) of the noisy channel, and your task is to implement the language model (the prior distribution term). At test time, you will be given a sentence with exactly one typing error. We then select the correction with the highest likelihood under the noisy channel model using your language model. Your models will be evaluated for accuracy.
The assignment consists of the following files (bold are to be implemented):
File | Description |
---|---|
data/holbrook-tagged-train.dat | The training corpus. |
data/holbrook-tagged-dev.dat | The development corpus. |
data/count_1edit.txt | Table listing counts of edits x|w (Wikipedia). |
python/SpellCorrect.py | Main runnable script that prepares the data and trains and evaluates the models. |
python/UniformLanguageModel.py | Already implemented uniform probability model (this is the basic layout of a language model). |
python/UnigramLanguageModel.py | Unigram probability model. |
python/LaplaceUnigramLanguageModel.py | A unigram model with add-one smoothing. Treat out-of-vocabulary items as a word which was seen zero times in training. |
python/LaplaceBigramLanguageModel.py | A bigram model with add-one smoothing. |
python/StupidBackoffLanguageModel.py | An unsmoothed bigram model combined with backoff to an add-one smoothed unigram model. |
python/CustomLanguageModel.py | A language model of your choice. |
python/HolbrookCorpus.py | Loading and representation of the corpus. |
python/Datum.py | One unit (token) of the corpus. |
python/Sentence.py | List of datums. |
python/EditModel.py | The edit model. |
python/SpellingResult.py | A spelling result. |
Your task is to implement the models shown in bold above. In the custom language model, you should implement a model of your choice, such as interpolated Kneser-Ney, Good-Turing smoothing, trigrams, or anything else you come up with, provided that you train the model on the training data supplied. The custom language model must perform at least as good as the stupid backoff model.
The evaluation is based on the implementation and accuracy of the models on development data. The expected log-probability performance is as follows: