9 Supervised Learning

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

All rtemis learners train a model, after optional running of hyperparameters by grid search when applicable, and validate it if a test set is provided. Use modSelect() to get a list of all available algorithms:

9.1 Data Input for Supervised Learning

All rtemis supervised learning functions begin with “s.” for “supervised”. They accept the same first four arguments:
x, y, x.test, y.test
but are flexible allowing you to also provide combined (x, y) and (x.test, y.test) data frames.

9.1.1 Scenario 1 (x.train, y.train, x.test, y.test)

In the most straightforward case, provide each individually:

  • x: Training set features
  • y: Training set outcome
  • x.test: Testing set features (Optional)
  • y.test: Testing set outcome (Optional)
[2019-08-21 08:18:15 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 147 x 10 
    Training outcome: 147 x 1 
    Testing features: 53 x 10 
     Testing outcome: 53 x 1 

[2019-08-21 08:18:18 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 0.84 (91.88%)
   RMSE = 0.92 (71.51%)
    MAE = 0.75 (69.80%)
      r = 0.96 (p = 5.9e-81)
    rho = 0.95 (p = 0)
   R sq = 0.92

[[ GLM Regression Testing Summary ]]
    MSE = 1.22 (89.03%)
   RMSE = 1.10 (66.88%)
    MAE = 0.90 (66.66%)
      r = 0.94 (p = 2.5e-26)
    rho = 0.95 (p = 0)
   R sq = 0.89

[2019-08-21 08:18:18 s.GLM] Run completed in 0.04 minutes (Real: 2.38; User: 1.20; System: 0.10) 

9.1.2 Scenario 2: (x.train, x.test)

You can provide training and testing sets as a single data.frame each where the last column is the outcome:

  • x: data.frame(x.train, y.train)
  • y: data.frame(x.test, y.test)
[2019-08-21 08:18:18 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 147 x 10 
    Training outcome: 147 x 1 
    Testing features: 53 x 10 
     Testing outcome: 53 x 1 

[2019-08-21 08:18:18 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 0.84 (91.88%)
   RMSE = 0.92 (71.51%)
    MAE = 0.75 (69.80%)
      r = 0.96 (p = 5.9e-81)
    rho = 0.95 (p = 0)
   R sq = 0.92

[[ GLM Regression Testing Summary ]]
    MSE = 1.22 (89.03%)
   RMSE = 1.10 (66.88%)
    MAE = 0.90 (66.66%)
      r = 0.94 (p = 2.5e-26)
    rho = 0.95 (p = 0)
   R sq = 0.89

[2019-08-21 08:18:19 s.GLM] Run completed in 0.01 minutes (Real: 0.39; User: 0.09; System: 0.01) 

The dataPrepare function will check data dimensions and determine whether data was input as separate feature and outcome sets or combined and ensure the correct number of cases and features was provided.

In either scenario, Regression will be performed if the outcome is numeric and Classification if the outcome is a factor.

9.2 Generalized Linear Model (GLM)

[2019-08-21 08:18:19 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 147 x 10 
    Training outcome: 147 x 1 
    Testing features: 53 x 10 
     Testing outcome: 53 x 1 

[2019-08-21 08:18:19 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 0.84 (91.88%)
   RMSE = 0.92 (71.51%)
    MAE = 0.75 (69.80%)
      r = 0.96 (p = 5.9e-81)
    rho = 0.95 (p = 0)
   R sq = 0.92

[[ GLM Regression Testing Summary ]]
    MSE = 1.22 (89.03%)
   RMSE = 1.10 (66.88%)
    MAE = 0.90 (66.66%)
      r = 0.94 (p = 2.5e-26)
    rho = 0.95 (p = 0)
   R sq = 0.89

[2019-08-21 08:18:19 s.GLM] Run completed in 2.9e-03 minutes (Real: 0.17; User: 0.09; System: 0.01) 

Note: If there are factor features, s.GLM will test that there are no levels present in the test set and not in the training. This would cause predict to fail. This is a problem that may arise when you are running multiple cross-validated experiments.

9.3 Elastic Net (Regularized GLM)

Regularization prevents overfitting and allows training a linear model on a dataset with more features than cases (p >> n).

[2019-08-21 08:18:22 s.GLMNET] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 1000 
    Training outcome: 374 x 1 
    Testing features: 126 x 1000 
     Testing outcome: 126 x 1 

[2019-08-21 08:18:23 gridSearchLearn] Running grid search... 

[[ Resampling Parameters ]]
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
[2019-08-21 08:18:23 resample] Created 5 independent folds 

[[ Search parameters ]]
    grid.params:  
                 alpha: 0 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
[2019-08-21 08:18:23 gridSearchLearn] Tuning Elastic Net by exhaustive grid search: 
[2019-08-21 08:18:23 gridSearchLearn] 5 resamples; 5 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
 

[[ Best parameters to minimize MSE ]]
   best.tune:  
              lambda: 197.615344869254 
               alpha: 0 

[2019-08-21 08:18:39 gridSearchLearn] Run completed in 0.28 minutes (Real: 16.54; User: 0.28; System: 0.21) 

[[ Parameters ]]
    alpha: 0 
   lambda: 197.615344869254 

[2019-08-21 08:18:39 s.GLMNET] Training elastic net model... 

[[ GLMNET Regression Training Summary ]]
    MSE = 439.92 (58.53%)
   RMSE = 20.97 (35.60%)
    MAE = 16.76 (35.85%)
      r = 0.96 (p = 4.4e-202)
    rho = 0.96 (p = 0)
   R sq = 0.59

[[ GLMNET Regression Testing Summary ]]
    MSE = 953.56 (16.68%)
   RMSE = 30.88 (8.72%)
    MAE = 24.96 (8.57%)
      r = 0.52 (p = 6.6e-10)
    rho = 0.47 (p = 3.7e-08)
   R sq = 0.17

[2019-08-21 08:18:40 s.GLMNET] Run completed in 0.29 minutes (Real: 17.47; User: 0.87; System: 0.25) 

[2019-08-21 08:18:40 s.GLMNET] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 1000 
    Training outcome: 374 x 1 
    Testing features: 126 x 1000 
     Testing outcome: 126 x 1 

[2019-08-21 08:18:40 gridSearchLearn] Running grid search... 

[[ Resampling Parameters ]]
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
[2019-08-21 08:18:40 resample] Created 5 independent folds 

[[ Search parameters ]]
    grid.params:  
                 alpha: 1 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
[2019-08-21 08:18:40 gridSearchLearn] Tuning Elastic Net by exhaustive grid search: 
[2019-08-21 08:18:40 gridSearchLearn] 5 resamples; 5 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
 

[[ Best parameters to minimize MSE ]]
   best.tune:  
              lambda: 5.36519161089185 
               alpha: 1 

[2019-08-21 08:18:47 gridSearchLearn] Run completed in 0.12 minutes (Real: 6.94; User: 0.06; System: 0.07) 

[[ Parameters ]]
    alpha: 1 
   lambda: 5.36519161089185 

[2019-08-21 08:18:47 s.GLMNET] Training elastic net model... 

[[ GLMNET Regression Training Summary ]]
    MSE = 1000.98 (5.65%)
   RMSE = 31.64 (2.86%)
    MAE = 25.24 (3.37%)
      r = 0.39 (p = 4.8e-15)
    rho = 0.38 (p = 2e-14)
   R sq = 0.06

[[ GLMNET Regression Testing Summary ]]
    MSE = 1133.60 (0.94%)
   RMSE = 33.67 (0.47%)
    MAE = 27.30 (4.9e-03%)
      r = 0.10 (p = 0.26)
    rho = 0.11 (p = 0.22)
   R sq = 0.01

[2019-08-21 08:18:47 s.GLMNET] Run completed in 0.13 minutes (Real: 7.67; User: 0.63; System: 0.12) 

If you do not define alpha, it defaults to seq(0, 1, 0.2), which means that grid search will be used for tuning.

[2019-08-21 08:18:48 s.GLMNET] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 1000 
    Training outcome: 374 x 1 
    Testing features: 126 x 1000 
     Testing outcome: 126 x 1 

[2019-08-21 08:18:48 gridSearchLearn] Running grid search... 

[[ Resampling Parameters ]]
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
[2019-08-21 08:18:48 resample] Created 5 independent folds 

[[ Search parameters ]]
    grid.params:  
                 alpha: 0, 0.2, 0.4, 0.6, 0.8, 1 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
[2019-08-21 08:18:48 gridSearchLearn] Tuning Elastic Net by exhaustive grid search: 
[2019-08-21 08:18:48 gridSearchLearn] 5 resamples; 30 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
 

[[ Best parameters to minimize MSE ]]
   best.tune:  
              lambda: 168.324073813456 
               alpha: 0 

[2019-08-21 08:19:17 gridSearchLearn] Run completed in 0.48 minutes (Real: 29.09; User: 0.54; System: 0.26) 

[[ Parameters ]]
    alpha: 0 
   lambda: 168.324073813456 

[2019-08-21 08:19:17 s.GLMNET] Training elastic net model... 

[[ GLMNET Regression Training Summary ]]
    MSE = 394.36 (62.83%)
   RMSE = 19.86 (39.03%)
    MAE = 15.86 (39.29%)
      r = 0.96 (p = 9.2e-209)
    rho = 0.96 (p = 0)
   R sq = 0.63

[[ GLMNET Regression Testing Summary ]]
    MSE = 937.60 (18.07%)
   RMSE = 30.62 (9.49%)
    MAE = 24.81 (9.12%)
      r = 0.52 (p = 4.8e-10)
    rho = 0.48 (p = 2.8e-08)
   R sq = 0.18

[2019-08-21 08:19:17 s.GLMNET] Run completed in 0.49 minutes (Real: 29.66; User: 0.98; System: 0.30) 

Many real-world relationships are nonlinear. A large number of regression approaches exist to model such relationships.
Let’s create some new synthetic data:

In this example, y depends on the cube of x

9.4 Polynomial regression

Probably the simplest way to model nonlinear relationships is to include polynomial terms in a linear model.
In rtemis, s.POLY is an alias to s.GLM(polynomial = TRUE) that will add polynomial terms using R’s poly() function

[2019-08-21 08:19:17 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 400 x 10 
    Training outcome: 400 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2019-08-21 08:19:17 s.GLM] Training GLM... 

[[ POLY Regression Training Summary ]]
    MSE = 0.89 (77.33%)
   RMSE = 0.94 (52.38%)
    MAE = 0.76 (48.34%)
      r = 0.88 (p = 2.6e-130)
    rho = 0.82 (p = 0)
   R sq = 0.77

[2019-08-21 08:19:18 s.GLM] Run completed in 3.1e-03 minutes (Real: 0.19; User: 0.12; System: 0.03) 

9.5 Generalized Additive Model (GAM)

Generalized Additive Models provide a very efficient way of fitting curves of any shape.
GAMs in rtemis can be fit with s.GAM (which uses mgcv::gam):

[2019-08-21 08:19:18 s.GAM.default] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 400 x 10 
    Training outcome: 400 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2019-08-21 08:19:18 s.GAM.default] Training GAM... 

[[ GAM Regression Training Summary ]]
    MSE = 0.91 (76.75%)
   RMSE = 0.96 (51.78%)
    MAE = 0.77 (47.83%)
      r = 0.88 (p = 3.9e-128)
    rho = 0.82 (p = 0)
   R sq = 0.77

[2019-08-21 08:19:19 s.GAM.default] Run completed in 0.01 minutes (Real: 0.89; User: 0.70; System: 0.06) 

9.6 Regularized GAM

Adding regularization to GAMs results in a very powerful nonnparametric regression tool which can be applied to wide datasets (where regular GAM would run out of degrees of freedom)

[2019-08-21 08:19:19 s.GAMSEL] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2019-08-21 08:19:20 s.GAMSEL] Training GAMSEL... 

[[ GAMSEL Regression Training Summary ]]
    MSE = 2.55 (93.65%)
   RMSE = 1.60 (74.80%)
    MAE = 1.17 (71.84%)
      r = 0.97 (p = 6.4e-307)
    rho = 0.94 (p = 0)
   R sq = 0.94

[2019-08-21 08:19:20 s.GAMSEL] Run completed in 0.02 minutes (Real: 0.97; User: 0.69; System: 0.09) 

9.7 Projection Pursuit Regression (PPR)

Projection Pursuit Regression is an extension of (generalized) additive models.
Where a linear model is a linear combination of a set of predictors,
an additive model is a linear combination of nonlinear transformations of a set of predictors,
a projection pursuit model is a linear combination of nonlinear transformations of linear combinations of predictors.

[2019-08-21 08:19:21 s.PPR] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
      nterms: 4 
    optlevel: 3 
   sm.method: supsmu 
        bass: 0 
        span: 0 
          df: 5 
      gcvpen: 1 

[2019-08-21 08:19:21 s.PPR] Running Projection Pursuit Regression... 

[[ PPR Regression Training Summary ]]
    MSE = 3.11 (92.26%)
   RMSE = 1.76 (72.18%)
    MAE = 1.31 (68.65%)
      r = 0.96 (p = 7e-279)
    rho = 0.93 (p = 0)
   R sq = 0.92

[2019-08-21 08:19:21 s.PPR] Run completed in 0.01 minutes (Real: 0.49; User: 0.31; System: 0.02) 

9.8 Support Vector Machine (SVM)

[2019-08-21 08:19:22 s.SVM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2019-08-21 08:19:22 s.SVM] Training SVM Regression with radial kernel... 

[[ SVM Regression Training Summary ]]
    MSE = 10.08 (74.90%)
   RMSE = 3.17 (49.90%)
    MAE = 1.33 (68.04%)
      r = 0.91 (p = 3.1e-188)
    rho = 0.98 (p = 0)
   R sq = 0.75

[2019-08-21 08:19:22 s.SVM] Run completed in 0.01 minutes (Real: 0.69; User: 0.38; System: 0.02) 

9.9 Classification and Regression Trees (CART)

[2019-08-21 08:19:23 s.CART] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2019-08-21 08:19:23 s.CART] Training CART... 

[[ CART Regression Training Summary ]]
    MSE = 8.04 (79.98%)
   RMSE = 2.84 (55.25%)
    MAE = 2.22 (46.85%)
      r = 0.89 (p = 4.7e-176)
    rho = 0.77 (p = 3.4e-101)
   R sq = 0.80

[2019-08-21 08:19:23 s.CART] Run completed in 3.8e-03 minutes (Real: 0.23; User: 0.14; System: 0.02) 

9.10 Random Forest

Multiple Random Forest implementations are included in rtemis. Ranger provides an efficient implementation well-suited for general use.

[2019-08-21 08:19:24 s.RANGER] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
   n.trees: 1000 
      mtry: NULL 

[2019-08-21 08:19:24 s.RANGER] Training Random Forest (ranger) Regression with 1000 trees... 

[[ RANGER Regression Training Summary ]]
    MSE = 2.74 (93.18%)
   RMSE = 1.66 (73.88%)
    MAE = 0.97 (76.72%)
      r = 0.98 (p = 0)
    rho = 0.98 (p = 0)
   R sq = 0.93

[2019-08-21 08:19:26 s.RANGER] Run completed in 0.05 minutes (Real: 2.86; User: 5.57; System: 0.07) 

9.11 Gradient Boosting

Gradient Boosting is, on average, the most powerful learning algorithm for structured data. rtemis includes multiple implementations of boosting, along with support to boost any learner - see chapter on Boosting.

[2019-08-21 08:19:27 s.GBM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 50 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available
[2019-08-21 08:19:27 s.GBM] Distribution set to gaussian 

[2019-08-21 08:19:27 s.GBM] Running Gradient Boosting Regression with a gaussian loss function 

[2019-08-21 08:19:27 gridSearchLearn] Running grid search... 

[[ Resampling Parameters ]]
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
[2019-08-21 08:19:27 resample] Created 5 independent folds 

[[ Search parameters ]]
    grid.params:  
                 interaction.depth: 2 
                         shrinkage: 0.01 
                      bag.fraction: 0.9 
                    n.minobsinnode: 5 
   fixed.params:  
                           n.trees: 2000 
                         max.trees: 5000 
                     n.tree.window: 0 
                 gbm.select.smooth: TRUE 
                       n.new.trees: 500 
                         min.trees: 50 
                    failsafe.trees: 1000 
                               ipw: TRUE 
                          ipw.type: 2 
                          upsample: FALSE 
                     resample.seed: NULL 
                            relInf: FALSE 
                   plot.tune.error: FALSE 
                               .gs: TRUE 
[2019-08-21 08:19:27 gridSearchLearn] Tuning Gradient Boosting Machine by exhaustive grid search: 
[2019-08-21 08:19:27 gridSearchLearn] 5 resamples; 5 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
 

[[ Best parameters to minimize MSE ]]
   best.tune:  
                        n.trees: 2836 
              interaction.depth: 2 
                      shrinkage: 0.01 
                   bag.fraction: 0.9 
                 n.minobsinnode: 5 

[2019-08-21 08:19:36 gridSearchLearn] Run completed in 0.14 minutes (Real: 8.30; User: 0.02; System: 0.04) 

[[ Parameters ]]
             n.trees: 2836 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 
[2019-08-21 08:19:36 s.GBM] Training GBM on full training set... 

[[ GBM Regression Training Summary ]]
    MSE = 1.46 (96.36%)
   RMSE = 1.21 (80.93%)
    MAE = 0.90 (78.46%)
      r = 0.98 (p = 0)
    rho = 0.97 (p = 0)
   R sq = 0.96
[2019-08-21 08:19:39 s.GBM] Calculating relative influence of variables... 

[2019-08-21 08:19:39 s.GBM] Run completed in 0.20 minutes (Real: 12.05; User: 3.05; System: 0.12)