## Project 1: Multilayer perceptron

### Overview

**Task**: implement a general multilayer perceptron classifier (supporting at least one hidden layer), trained by the backpropagation algorithm. Employ this model on a task of classifying points on a plane into three categories. Use a validation technique to select the best performing model, then perform final testing.

**Deadline**: March 31st, 23:59 CEST

### Specifics

#### Model

- Multi-Layer Perceptron, having at least one
*non-linear*hidden layer - (Stochastic) Gradient Descent via Back-Propagation (online, “true” or mini-batch)

#### Data

- one-line header, then one sample per line
- points in a 2D plane (2 real-valued inputs)
three output classes (

`A`

,`B`

and`C`

)- train set –
`2d.trn.dat`

, 8000 samples – training data (estimation and validation) test set –

`2d.tst.dat`

, 2000 samples – testing data

#### Training

Split the training data set into a bigger *estimation* subset and a smaller *validation* subset.^{1} Use this split to perform *model selection*, i.e. find the best-performing combination of *hyper-parameters* (*model architecture*: number of hidden layers, neuron counts, …; *training parameters*: learning rate, …):

- train the model on the
*estimation*subset - test the model on the
*validation*subset (not the*test*set! don’t touch that yet!) - remember the hyper-parameters of the best performing model

(*Sanity check*: a properly working network should reach a classification accuracy of \(\geq 95\%\))

Also try experimenting with some of the following:

- input preprocessing (e.g. normalization/rescaling)
- activation functions (
*logsig*,*tanh*,*softmax*, …) - output encoding (
*one-hot encoding*,*ordinal*) - training length and/or
*early-stopping* - learning rate schedule
- weight initialization type (
*uniform*,*Gaussian*,*sparse*,*orthogonal*) and scale - momentum type (
*none*,*classic*,*Nesterov’s accelerated gradient*) and strength - regularization:
- implicit (weight decay, …)
- explicit (\(L_1\), \(L_2\), …)

- regression error metric used for training (
*square error*,*categorical cross-entropy*/*log-loss*, …)

#### Testing

Using the best performing set of hyper-parameters (on the validation set), train a *new* model on the full *training set*, then perform final testing on the *test set*. Report classification accuracy, regression error and calculate a confusion matrix.

#### Bonus

Train the model using a more sophisticated method, such as:

- Scaled Conjugated Gradient [2 pt]
- A newer method (published after 2010): Adagrad, RMSprop, ADAM,... [1 pt]

### Submission

Submit your code and report using this project as a single archive (e.g `.zip`

).

#### Code

Projects should be written in Python; use of previously finished labs is strongly encouraged. All the “interesting” bits should be identifiable in the code (especially all the relevant equations). You’ll probably need no additional libraries other than the standard `numpy`

/`scipy`

/`matplotlib`

combo. Don’t reimplement the wheel, use `np.loadtxt`

/`np.savetxt`

/`np.load`

/`np.save`

and `plt.savefig`

where necessary.)

Model selection should not be performed by hand, but the project should rather include a runnable program^{2} that goes through the various combinations of parameters, selects the best performing model and tests and produces final outputs.

(If you’d desperately prefer to use another language, write us an e-mail.)

#### Report

Create a report – briefly describing model selection, training and testing – in `.pdf`

format. The report should be sufficiently detailed so that one can read the description, reimplement your project and using the provided parameters arrive at the *same results* (reproducibility). (Assume previous knowledge of neural network algorithms, so for example don’t explain how backpropagation works. But include details such as if the training was online/minibatch/batch and whether and what strength of momentum was used.)

- for each examined model (hyper-parameter combination), report estimation and validation error
**[table]** - for the best model:
- error vs. time (at least one instance)
**[plot]** - outputs in 2D
**[plot]** - confusion matrix
**[table]**- rows = actual classes
- columns = predicted classes
- sum of each column = 100%

- error vs. time (at least one instance)
- correct submissions with highest (testing) accuracies will be awarded bonus points