Project 2b: Visualizing Multi-Dimensional Data

Overview

Implement a self-organizing map (wiki, slides) with rectangular1 topology and use it to visualize the Seeds data set from the UCI Machine Learning Repository in two dimensions.
[The data file and additional information can be found on the original dataset homepage. Alternatively: a simple text file having 210 lines (samples) and 8 columns (7 features + class).]

Deadline: April 25th, 2019

Report

In your report, include:

Bonus

Examine whether a self-organizing map can be used as a succesful classifier.

  1. split the dataset (210 samples) into a training (150 samples) and a testing set (60 samples)
    • make sure that the classes are equally represented in both of them
  2. train the map on the features (but not classes) of the training set
  3. assign a class to each neuron of the map
    • most prevalent class of inputs corresponding to that neuron
  4. test: for each test input, find the best-matching neuron and output its class

(Feel free to implement and describe a more sophisticated scheme for classifying.)
Investigate how to select the map parameters to maximize the testing accuracy.
Report classification accuracy and the confusion matrix.

Example

Diagrams from a self-organizing map of size 20x15 (not perfectly) trained on the Iris dataset:


  1. Or hexagonal, if you want to; yields nicer results.

  2. quantization error = average distance of data point \(x_i\) to it’s best matching neuron \(c_j\): \[ E = \frac{1}{n} \sum_i min_j \| x_i - c_j \| \]

  3. average amount of adjustment of a neurons at a time \(t\) – let \(\Delta c_j(t) = c_j(t) - c_j(t-1)\): \[ A(t) = \frac{1}{k} \sum_j \| \Delta c_j(t) \| \]