Project 2b: Visualizing Multi-Dimensional Data
Overview
Implement a self-organizing map (wiki, slides) with rectangular1 topology and use it to visualize the Seeds data set from the UCI Machine Learning Repository in two dimensions.
[The data file and additional information can be found on the original dataset homepage. Alternatively: a simple text file having 210 lines (samples) and 8 columns (7 features + class).]
Deadline: April 25th, 2019
Report
In your report, include:
- how quantization error2 decreases throughout training [plot]
- how the average adjustment3 of neuron positions changes during training [plot]
- which neurons are activated by which classes of input [diagram]
- single rectangular graph showing class membership for relevant neurons
- sanity check: the classes should not collide much in a properly trained SOM
- how the value of each of the seven attributes changes across the map [heatmap]
- one rectangular heatmap for each attribute
- the U-matrix, distances between adjacent neurons (jointly for both dimensions; examples below show distances in individual map dimensions) [heatmap]
Bonus
Examine whether a self-organizing map can be used as a succesful classifier.
- split the dataset (210 samples) into a training (150 samples) and a testing set (60 samples)
- make sure that the classes are equally represented in both of them
- train the map on the features (but not classes) of the training set
- assign a class to each neuron of the map
- most prevalent class of inputs corresponding to that neuron
- test: for each test input, find the best-matching neuron and output its class
(Feel free to implement and describe a more sophisticated scheme for classifying.)
Investigate how to select the map parameters to maximize the testing accuracy.
Report classification accuracy and the confusion matrix.
Example
Diagrams from a self-organizing map of size 20x15 (not perfectly) trained on the Iris dataset:
Or hexagonal, if you want to; yields nicer results.↩
quantization error = average distance of data point \(x_i\) to it’s best matching neuron \(c_j\): \[ E = \frac{1}{n} \sum_i min_j \| x_i - c_j \| \]↩
average amount of adjustment of a neurons at a time \(t\) – let \(\Delta c_j(t) = c_j(t) - c_j(t-1)\): \[ A(t) = \frac{1}{k} \sum_j \| \Delta c_j(t) \| \]↩