8 Unsupervised Learning

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

Unsupervised learning aims to learn relationships within a dataset without focusing at a particular outcome. You will often hear of unsupervised learning being performed on unlabeled data. To be clear, it means it does not use the labels to guide learning - whether labels are available or not. You might, for example, perform unsupervised learning ahead of supervised learning as we shall see later. Unsupervised learning includes a number of approaches, most of which can be divided into two categories:

  • Clustering: Cases are grouped together based on some derived measure of similarity / distance metric.
  • Dimensionality Reduction / Matrix decomposition: Variables are combined / projected into a lower dimensional space.

In rtemis, clustering algorithms begin with u. and decomposition/dimensionality reduction algorithms begin with d. (We use u. because c. is reserved for the builtin R function)

8.1 Decomposition / Dimensionality Reduction

Use decomSelect() to get a listing of available decomposition algorithms:

.:decomSelect
rtemis supports the following decomposition algorithms:

    Name                                   Description
     CUR                      CUR Matrix Approximation
   H2OAE                               H2O Autoencoder
 H2OGLRM                H2O Generalized Low-Rank Model
     ICA                Independent Component Analysis
  ISOMAP                                        ISOMAP
    KPCA           Kernel Principal Component Analysis
     LLE                      Locally Linear Embedding
     MDS                      Multidimensional Scaling
     NMF             Non-negative Matrix Factorization
     PCA                  Principal Component Analysis
    SPCA           Sparse Principal Component Analysis
     SVD                  Singular Value Decomposition
    TSNE   t-distributed Stochastic Neighbor Embedding
    UMAP Uniform Manifold Approximation and Projection

We can further divide decomposition algorithms into linear (e.g. PCA, ICA, NMF) and nonlinear dimensionality reduction, (also called manifold learning, like LLE and tSNE).

8.1.1 Linear Dimensionality Reduction

As a simple example, let’s look the famous iris dataset. Note that we use this to demonstrate usage and is not a good example to assess the effectiveness of decomposition algorithms as the iris dataset consists of only 4 variables.
First, we select all variables from the iris dataset, excluding the group names, i.e. the labels:

Now, let’s try a few different algorithms, projecting to two dimensions and visualizing using mplot3.xy. Notice we are using the real labels to colo points in these examples:

8.1.1.1 Principal Component Analysic (PCA)

[2019-08-02 17:08:55 d.PCA] Hello, egenn 
[2019-08-02 17:08:55 d.PCA] ||| Input has dimensions 150 rows by 4 columns, 
[2019-08-02 17:08:55 d.PCA]     interpreted as 150 cases with 4 features. 
[2019-08-02 17:08:55 d.PCA] Performing Principal Component Analysis... 

[2019-08-02 17:08:55 d.PCA] Run completed in 5.8e-04 minutes (Real: 0.04; User: 0.01; System: 1e-03) 

8.1.1.2 Independent Component Analysis (ICA)

[2019-08-02 17:08:56 d.ICA] Hello, egenn 
[2019-08-02 17:08:56 d.ICA] ||| Input has dimensions 150 rows by 4 columns, 
[2019-08-02 17:08:56 d.ICA]     interpreted as 150 cases with 4 features. 
[2019-08-02 17:08:56 d.ICA] Running Independent Component Analysis... 

[2019-08-02 17:08:56 d.ICA] Run completed in 2.7e-04 minutes (Real: 0.02; User: 0.01; System: 2e-03) 

8.1.1.3 Non-negative Matrix Factorization (NMF)

[2019-08-02 17:08:57 d.NMF] Hello, egenn 
[2019-08-02 17:08:59 d.NMF] ||| Input has dimensions 150 rows by 4 columns, 
[2019-08-02 17:08:59 d.NMF]     interpreted as 150 cases with 4 features. 
[2019-08-02 17:08:59 d.NMF] Running Non-negative Matrix Factorization... 

[2019-08-02 17:09:02 d.NMF] Run completed in 0.09 minutes (Real: 5.64; User: 3.08; System: 0.18) 

8.1.2 Non-linear dimensionality reduction

8.1.2.1 Isomap

[2019-08-02 17:09:03 d.ISOMAP] Hello, egenn 
[2019-08-02 17:09:05 d.ISOMAP] ||| Input has dimensions 150 rows by 4 columns, 
[2019-08-02 17:09:05 d.ISOMAP]     interpreted as 150 cases with 4 features. 
[2019-08-02 17:09:05 d.ISOMAP] Running Isomap... 

[2019-08-02 17:09:05 d.ISOMAP] Run completed in 0.03 minutes (Real: 2.05; User: 1.23; System: 0.10) 

8.1.2.2 t-distributed Stochastic Neighbor Embedding (t-SNE)

[2019-08-02 17:09:05 d.TSNE] Hello, egenn 
[2019-08-02 17:09:05 d.TSNE] Running t-distributed Stochastic Neighbot Embedding 
[2019-08-02 17:09:05 d.TSNE] ||| Input has dimensions 149 rows by 4 columns, 
[2019-08-02 17:09:05 d.TSNE]     interpreted as 149 cases with 4 features. 
[2019-08-02 17:09:05 d.TSNE] Running t-SNE... 
Performing PCA
Read the 149 x 4 data matrix successfully!
OpenMP is working. 1 threads.
Using no_dims = 2, perplexity = 10.000000, and theta = 0.000000
Computing input similarities...
Symmetrizing...
Done in 0.01 seconds!
Learning embedding...
Iteration 50: error is 55.938665 (50 iterations in 0.01 seconds)
Iteration 100: error is 54.077677 (50 iterations in 0.01 seconds)
Iteration 150: error is 55.196859 (50 iterations in 0.01 seconds)
Iteration 200: error is 52.165526 (50 iterations in 0.01 seconds)
Iteration 250: error is 53.116539 (50 iterations in 0.01 seconds)
Iteration 300: error is 1.298310 (50 iterations in 0.01 seconds)
Iteration 350: error is 0.450253 (50 iterations in 0.01 seconds)
Iteration 400: error is 0.399577 (50 iterations in 0.02 seconds)
Iteration 450: error is 0.334860 (50 iterations in 0.02 seconds)
Iteration 500: error is 0.320333 (50 iterations in 0.02 seconds)
Iteration 550: error is 0.316245 (50 iterations in 0.02 seconds)
Iteration 600: error is 0.313185 (50 iterations in 0.02 seconds)
Iteration 650: error is 0.311040 (50 iterations in 0.02 seconds)
Iteration 700: error is 0.309425 (50 iterations in 0.02 seconds)
Iteration 750: error is 0.308175 (50 iterations in 0.02 seconds)
Iteration 800: error is 0.307157 (50 iterations in 0.02 seconds)
Iteration 850: error is 0.306294 (50 iterations in 0.01 seconds)
Iteration 900: error is 0.305547 (50 iterations in 0.01 seconds)
Iteration 950: error is 0.304907 (50 iterations in 0.02 seconds)
Iteration 1000: error is 0.304334 (50 iterations in 0.02 seconds)
Fitting performed in 0.29 seconds.

[2019-08-02 17:09:05 d.TSNE] Run completed in 0.01 minutes (Real: 0.40; User: 0.31; System: 0.01) 

8.2 Clustering

Use clustSelect() to get a listing of available clustering algorithms:

.:clustSelect
rtemis supports the following clustering algorithms:

      Name                                             Description
    CMEANS                                Fuzzy C-means Clustering
       EMC                     Expectation Maximization Clustering
    HARDCL                               Hard Competitive Learning
    HOPACH Hierarchical Ordered Partitioning And Collapsing Hybrid
 H2OKMEANS                                  H2O K-Means Clustering
    KMEANS                                      K-Means Clustering
      NGAS                                   Neural Gas Clustering
       PAM                             Partitioning Around Medoids
      PAMK           Partitioning Around Medoids with k estimation
      SPEC                                     Spectral Clustering

Let’s cluster iris and we shall also use an NMF decomposition as we saw above to project to 2 dimensions.

[2019-08-02 17:09:06 d.NMF] Hello, egenn 
[2019-08-02 17:09:06 d.NMF] ||| Input has dimensions 150 rows by 4 columns, 
[2019-08-02 17:09:06 d.NMF]     interpreted as 150 cases with 4 features. 
[2019-08-02 17:09:06 d.NMF] Running Non-negative Matrix Factorization... 

[2019-08-02 17:09:10 d.NMF] Run completed in 0.08 minutes (Real: 4.56; User: 2.18; System: 0.08) 

8.2.1 K-Means

[2019-08-02 17:09:10 u.KMEANS] Hello, egenn 
[2019-08-02 17:09:11 u.KMEANS] Performing K-means Clustering with k = 3... 

[2019-08-02 17:09:11 u.KMEANS] Run completed in 0.01 minutes (Real: 0.32; User: 0.24; System: 0.02) 

8.2.2 Partitioning Around Medoids with k estimation (PAMK)

[2019-08-02 17:09:11 u.PAMK] Hello, egenn 
[2019-08-02 17:09:12 u.PAMK] Partitioning Around Medoids... 
[2019-08-02 17:09:13 u.PAMK] Estimated optimal number of clusters: 3 

[2019-08-02 17:09:13 u.PAMK] Run completed in 0.03 minutes (Real: 1.51; User: 0.90; System: 0.06) 

8.2.3 Neural Gas

[2019-08-02 17:09:13 u.NGAS] Hello, egenn 
[2019-08-02 17:09:13 u.NGAS] Performing Neural Gas clustering with k = 3... 

[2019-08-02 17:09:13 u.NGAS] Run completed in 9.5e-04 minutes (Real: 0.06; User: 0.04; System: 1e-03) 

8.2.4 Hard Competitive Learning

[2019-08-02 17:09:14 u.HARDCL] Hello, egenn 
[2019-08-02 17:09:14 u.HARDCL] Running Hard Competitive Learning with k = 3... 

[2019-08-02 17:09:14 u.HARDCL] Run completed in 6e-04 minutes (Real: 0.04; User: 0.03; System: 0) 

8.2.5 Hierarchical Ordering and Partitioning Hybrid

[2019-08-02 17:09:14 depCheck] Dependencies check passed 
[2019-08-02 17:09:14 u.HOPACH] Running HOPACH clustering... 
Searching for main clusters... 
Level  1 
Identified 3  main clusters in level 1 with MSS = 0.5533997 
Running down without collapsing from Level 1 
Level 2 
Level 3 
Level 4 
Level 5 
Level 6 
[2019-08-02 17:09:17 u.HOPACH] HOPACH identified 3 clusters (sizes: 50, 44, 56) 

[2019-08-02 17:09:17 u.HOPACH] Run completed in 0.05 minutes (Real: 3.00; User: 1.29; System: 0.04)