This is the course website for the 2017/2018 academic year. See the current iteration.
.

Natural Language Processing

Task 03: Spelling Correction

Deadline

1st April 2018

Submission

Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 03" as the subject.

Acknowledgment

This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.

Description

The aim of this assignment is to implement a spell checker using noisy channel. In particular, you will be given the edit model (the likelihood term) of the noisy channel, and your task is to implement the language model (the prior distribution term). At test time, you will be given a sentence with exactly one typing error. We then select the correction with the highest likelihood under the noisy channel model using your language model. Your models will be evaluated for accuracy.

Files

The assignment consists of the following files (bold are to be implemented):

FileDescription
data/holbrook-tagged-train.datThe training corpus.
data/holbrook-tagged-dev.datThe development corpus.
data/count_1edit.txtTable listing counts of edits x|w (Wikipedia).
python/SpellCorrect.pyMain runnable script that prepares the data and trains and evaluates the models.
python/UniformLanguageModel.pyAlready implemented uniform probability model (this is the basic layout of a language model).
python/UnigramLanguageModel.pyUnigram probability model.
python/LaplaceUnigramLanguageModel.pyA unigram model with add-one smoothing. Treat out-of-vocabulary items as a word which was seen zero times in training.
python/LaplaceBigramLanguageModel.pyA bigram model with add-one smoothing.
python/StupidBackoffLanguageModel.pyAn unsmoothed bigram model combined with backoff to an add-one smoothed unigram model.
python/CustomLanguageModel.pyA language model of your choice.
python/HolbrookCorpus.pyLoading and representation of the corpus.
python/Datum.pyOne unit (token) of the corpus.
python/Sentence.pyList of datums.
python/EditModel.pyThe edit model.
python/SpellingResult.pyA spelling result.

Goal

Your task is to implement the models shown in bold above. In the custom language model, you should implement a model of your choice, such as interpolated Kneser-Ney, Good-Turing smoothing, trigrams, or anything else you come up with, provided that you train the model on the training data supplied. The custom language model must perform at least as good as the stupid backoff model.

Grading

The evaluation is based on the implementation and accuracy of the models on development data. The expected log-probability performance is as follows:

Download files