Natural Language Processing

Projects

Overview

To turn in a successful project you are required to

pick an interesting NLP / computational linguistics problem of your choice
pick a solution and implement it
evaluate your solution using an appropriate corpus
write a report describing your solution

You may work individually or in a team of up to four people.

The expected size of the project is about the size of a programming assignment plus the report.

Deadlines

21st May: project definition
3 days before the end of terms period (29th May or 26th June): project submission

Report

The report is a 5-15 page long document in form of a scientific paper, which will be used as the most important factor for your grading. The document should contain:

title page with the name of the author(s)
abstract (1-2 paragraphs) summarizing the project aim, methods, and results
introduction: the description of the task, the source of the corpus (dataset), corpus description (train/test set sizes, whether it is pure text or has attributes, description of attributes))
description of used methods and why you've decided to use them
description of technical implementation details: which programming language and libraries were used and why, which problems you ran into and how you solved them
experimental evaluation: methods of evaluation of performance of your solution, which parameter values were used in your model and comparisons with different values, comparisons with existing solutions on same/similar data, tables, plots, their interpretations and findings
conclusion: whether the project was successful or not, why, (hypothetical) future work

The report does not have to be overly complex or long. Try to describe each of the points concisely, length alone will not be considered during grading.

Grading

appropriate choice of problem: 20%
appropriate choice of methods used: 30%
quality of dataset, implementation, and report: 50%

Submission

Please submit the solution (implementation, report, and data, or a way to acquire the data) by email to the lecturer's email address. Use "NLP: Project" as the subject.

Notes

the corpus may be of arbitrary natural language
the report may be written in English or Slovak
you may use existing libraries (for ML / NLP algorithms etc.) in your implementation, but you have to describe the parts that were used, which parameters were used and what role they play in the solution
if unsure about what problem to pick, contact the lecturer
interesting datasets can be found at https://github.com/niderhoff/nlp-datasets

Example Problems

automated annotation of documents (e.g. genre, domain, date of writing, age of author)
named entity recognition / relation extraction in interesting domains, such as movie industry
POS tagging / parsing on Slovak language
using information extraction to build knowledge graphs
using word sense similarity to clear redundancy in knowledge graphs