Project 2b: Visualizing Multi-Dimensional Data

Overview

Implement a self-organizing map (wiki, slides) with rectangular¹ topology and use it to visualize the Seeds data set from the UCI Machine Learning Repository in two dimensions.
[The data file and additional information can be found on the original dataset homepage. Alternatively: a simple text file having 210 lines (samples) and 8 columns (7 features + class).]

Deadline: April 25th, 2019

Report

In your report, include:

how quantization error² decreases throughout training [plot]
how the average adjustment³ of neuron positions changes during training [plot]
which neurons are activated by which classes of input [diagram]
- single rectangular graph showing class membership for relevant neurons
- sanity check: the classes should not collide much in a properly trained SOM
how the value of each of the seven attributes changes across the map [heatmap]
- one rectangular heatmap for each attribute
the U-matrix, distances between adjacent neurons (jointly for both dimensions; examples below show distances in individual map dimensions) [heatmap]

Bonus

Examine whether a self-organizing map can be used as a succesful classifier.

split the dataset (210 samples) into a training (150 samples) and a testing set (60 samples)
- make sure that the classes are equally represented in both of them
train the map on the features (but not classes) of the training set
assign a class to each neuron of the map
- most prevalent class of inputs corresponding to that neuron
test: for each test input, find the best-matching neuron and output its class

(Feel free to implement and describe a more sophisticated scheme for classifying.)
Investigate how to select the map parameters to maximize the testing accuracy.
Report classification accuracy and the confusion matrix.

Example

Diagrams from a self-organizing map of size 20x15 (not perfectly) trained on the Iris dataset:

Or hexagonal, if you want to; yields nicer results.↩
quantization error = average distance of data point \(x_i\) to it’s best matching neuron \(c_j\): \[ E = \frac{1}{n} \sum_i min_j \| x_i - c_j \| \]↩
average amount of adjustment of a neurons at a time \(t\) – let \(\Delta c_j(t) = c_j(t) - c_j(t-1)\): \[ A(t) = \frac{1}{k} \sum_j \| \Delta c_j(t) \| \]↩