hide show

This is the course website for the 2017/2018 academic year. See the current iteration.

Natural Language Processing

Task 06: Word Embeddings and POS Tagging

Deadline

28th May 2018

Submission

Please submit the solution by email to the lecturer's email address. Use "NLP: Task 06" as the subject.

Acknowledgment

This assignment is adapted from the PyTorch tutorials by Robert Guthrie.

Description

In this assignment you will work with the PyTorch framework for neural networks. There are two separate tasks which can be solved individually. In the first one, you will learn how to train word embedding models using dense distributed representations. Then, you will learn the outline of designing a simple POS tagging model given an example using LSTM networks and implement your own augmentation of this model.

Files (word embeddings)

The first task of the assignment consists of the following files (bold are to be implemented):

File	Description
trigrams.py	Neural trigrams (example)
cbow.py	CBOW model

Goal (word embeddings)

In the first task you will follow the PyTorch Word Embeddings tutorial. To get yourself acquainted with PyTorch, a lot of tutorials are available, and you will often use the developer's documentation: torch, torch.nn. Review the n-gram model in the word embeddings tutorial and try to implement the CBOW model.

Files (POS tagging)

The second task of the assignment assignment consists of the following files:

File	Description
data/train.txt	Training data
data/train-small.txt	Smaller training data (for debugging)
tagger.py	Already implemented POS tagger
lstm.py	Already implemented POS tagging LSTM model and helper functions

Goal (POS tagging)

The second task is to design and implement a sequence model for performing simple POS tagging. The task is derived from the PyTorch Sequence Modeling / LSTM tutorial, but we use the CoNLL-2000 shared task chunking data containing POS-tagged sentences. First, review the example model in lstm.py and tagger.py and try to run the latter. The accuracy of the model is evaluated (although only on the training data, see the comments in the source code). Your task (different from the one in the tutorial) is to improve this model by enriching the word embeddings. Hint:

Create separate embeddings for word suffices and/or prefices, then concatenate the word and affix embeddings and pass the concatenated tensors to LSTM.

Then, tune the hyperparameters to achieve accuracy at minimum 0.5 (exact) and 0.95 (total). You should not need to use more than 30 epochs. Again, this is training accuracy and the required minimum is very low, so this does not reach practical measures, but the point is purely didactical. You are encouraged to use a test set and incorporate out-of-vocabulary words in your projects.

Grading

25%: correctly implement the CBOW model
75%: suitably enrich the representation in the LSTM model