Tutorial

Dimensionality Reduction

In this blogpost we want to learn how to do dimensionality reduction for datasets.

This can be used to visualise word embeddings or other data with more than 2 or 3 dimensions.

Dimensionality reduction for 2d and 3d visualisation

For this, 2 algorithms in particular, T-SNE and PCA, are easy to use because they are already implemented in sklearn.

First, a dataset has to be loaded, in this case lets use a simple dataset from sklearn:

In [1]:
%matplotlib notebook

from sklearn.datasets import load_digits

digits = load_digits()
print("Data:")
print(digits.data)
print("Maximum Value:")
print(digits.data.max())
print("Normalized Data:")
print(digits.data/digits.data.max())


data = (digits.data/digits.data.max())[:500]
labels = digits.target[:500]

print(labels)
Data:
[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]
Maximum Value:
16.0
Normalized Data:
[[ 0.      0.      0.3125 ...,  0.      0.      0.    ]
 [ 0.      0.      0.     ...,  0.625   0.      0.    ]
 [ 0.      0.      0.     ...,  1.      0.5625  0.    ]
 ..., 
 [ 0.      0.      0.0625 ...,  0.375   0.      0.    ]
 [ 0.      0.      0.125  ...,  0.75    0.      0.    ]
 [ 0.      0.      0.625  ...,  0.75    0.0625  0.    ]]
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0
 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9
 5 2 8 2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4
 4 7 2 8 2 2 5 7 9 5 4 8 8 4 9 0 8 9 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
 8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2
 0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8 2 0 0 1 7 6 3 2 1 7 3 1 3 9 1
 7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2 8 2 2 5 5 4 8 8 4 9 0 8 9 8 0 1 2
 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9
 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8
 2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2
 8 2 2 5 7 9 5 4 8 8 4 9 0 8 9 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2
 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8 2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7
 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2]

Now we need to do dimensionality reduction on the data:

In [2]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

twod_pca_data = TSNE(n_components=2, perplexity=100.0).fit_transform(data)
threed_pca_data = TSNE(n_components=3, perplexity=100.0).fit_transform(data)

2D visualisation

Now we try a simple visualisation for two dimensional data, where every class has its own color.

In [4]:
import matplotlib.pyplot as plt

for label in set(digits.target_names):
    data_for_label = twod_pca_data[labels == label]
    plt.scatter(data_for_label[:, 0], data_for_label[:, 1], label=str(label))
plt.legend()
plt.tight_layout()
plt.show()

3D visualisation

We can do the same, using the data that has been reduced to three dimensions, to generate a 3D plot.

This is a bit more complicated codewise, but if you have it done once, its very easy to replicate.

In [5]:
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')
print(ax)
for label in set(digits.target_names):
    data_for_label = threed_pca_data[labels == label]
    ax.scatter(data_for_label[:, 0], data_for_label[:, 1], data_for_label[:, 2], label=str(label), s=300)
plt.legend()
plt.tight_layout()
plt.show()
Axes(0.125,0.11;0.775x0.77)