hide show

This is the course website for the 2017/2018 academic year. See the current iteration.

Natural Language Processing

Task 03: Spelling Correction

Deadline

1st April 2018

Submission

Please submit the solution (runnable source files) by email to the lecturer's email address. Use "NLP: Task 03" as the subject.

Acknowledgment

This homework is adapted from Chris Manning and Dan Jurafsky's Coursera NLP class from 2012.

Description

The aim of this assignment is to implement a spell checker using noisy channel. In particular, you will be given the edit model (the likelihood term) of the noisy channel, and your task is to implement the language model (the prior distribution term). At test time, you will be given a sentence with exactly one typing error. We then select the correction with the highest likelihood under the noisy channel model using your language model. Your models will be evaluated for accuracy.

Files

The assignment consists of the following files (bold are to be implemented):

File	Description
data/holbrook-tagged-train.dat	The training corpus.
data/holbrook-tagged-dev.dat	The development corpus.
data/count_1edit.txt	Table listing counts of edits x\|w (Wikipedia).
python/SpellCorrect.py	Main runnable script that prepares the data and trains and evaluates the models.
python/UniformLanguageModel.py	Already implemented uniform probability model (this is the basic layout of a language model).
python/UnigramLanguageModel.py	Unigram probability model.
python/LaplaceUnigramLanguageModel.py	A unigram model with add-one smoothing. Treat out-of-vocabulary items as a word which was seen zero times in training.
python/LaplaceBigramLanguageModel.py	A bigram model with add-one smoothing.
python/StupidBackoffLanguageModel.py	An unsmoothed bigram model combined with backoff to an add-one smoothed unigram model.
python/CustomLanguageModel.py	A language model of your choice.
python/HolbrookCorpus.py	Loading and representation of the corpus.
python/Datum.py	One unit (token) of the corpus.
python/Sentence.py	List of datums.
python/EditModel.py	The edit model.
python/SpellingResult.py	A spelling result.

Goal

Your task is to implement the models shown in bold above. In the custom language model, you should implement a model of your choice, such as interpolated Kneser-Ney, Good-Turing smoothing, trigrams, or anything else you come up with, provided that you train the model on the training data supplied. The custom language model must perform at least as good as the stupid backoff model.

Grading

The evaluation is based on the implementation and accuracy of the models on development data. The expected log-probability performance is as follows:

Laplace unigram language model: 0.11
Laplace bigram language model: 0.13
Stupid backoff language model: 0.18
Custom language model: at least 0.18