14 Bagging

Bagging, i.e. bootstrap aggregating, is a core ML ensemble technique which can reduce bias and variance of a learner. In bagging, the training set is resampled, and a model is trained on each resample. Predictions from each model are averaged to give the final estimate. Random Forest is the most popular application of bagging. rtemis allows you to easily bag any learner - but don’t try bagging a linear model.

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

14.1 Regression

First, create some synthetic data:

14.1.1 Single CART

Let’s start by training a single CART of depth 3:

[2019-08-02 17:27:33 s.CART] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 373 x 50 
    Training outcome: 373 x 1 
    Testing features: 127 x 50 
     Testing outcome: 127 x 1 

[2019-08-02 17:27:34 s.CART] Training CART... 

[[ CART Regression Training Summary ]]
    MSE = 7.54 (67.32%)
   RMSE = 2.75 (42.84%)
    MAE = 2.17 (43.35%)
      r = 0.82 (p = 3.9e-92)
    rho = 0.81 (p = 1.3e-88)
   R sq = 0.67

[[ CART Regression Testing Summary ]]
    MSE = 7.83 (64.11%)
   RMSE = 2.80 (40.09%)
    MAE = 2.17 (40.52%)
      r = 0.80 (p = 7.3e-30)
    rho = 0.78 (p = 2e-27)
   R sq = 0.64

[2019-08-02 17:27:34 s.CART] Run completed in 0.02 minutes (Real: 1.24; User: 1.03; System: 0.08) 

14.1.2 Bagged CARTs

Let’s bag 20 CARTs

[2019-08-02 17:27:35 bag] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 373 x 50 
    Training outcome: 373 x 1 
    Testing features: 127 x 50 
     Testing outcome: 127 x 1 

[[ Parameters ]]
          mod: cart 
   mod.params:  
               maxdepth: 3 
[2019-08-02 17:27:35 bag] Bagging 20 Classification and Regression Trees... 

[2019-08-02 17:27:35 resLearn] Training Classification and Regression Trees on 20 bootstrap resamples... 
[2019-08-02 17:27:35 resLearn] Parallelizing by forking on 4 cores... 

[[ Regression Training Summary ]]
    MSE = 6.07 (73.72%)
   RMSE = 2.46 (48.73%)
    MAE = 1.92 (50.09%)
      r = 0.87 (p = 5e-117)
    rho = 0.88 (p = 2e-121)
   R sq = 0.74

[[ Regression Testing Summary ]]
    MSE = 6.75 (69.06%)
   RMSE = 2.60 (44.38%)
    MAE = 1.98 (45.56%)
      r = 0.84 (p = 3e-35)
    rho = 0.84 (p = 4.2e-35)
   R sq = 0.69

[2019-08-02 17:27:37 bag] Run completed in 0.03 minutes (Real: 1.91; User: 0.18; System: 0.12) 

We make two important observations:

  • Both training and testing error is reduced
  • Generalizability is increased, i.e. the gap between training and testing error is decreased

14.2 Classification

We’ll use the Sonar dataset

14.2.1 Single CART

[2019-08-02 17:27:40 s.CART] Hello, egenn 

[2019-08-02 17:27:40 dataPrepare] Imbalanced classes: using Inverse Probability Weighting 

[[ Classification Input Summary ]]
   Training features: 155 x 60 
    Training outcome: 155 x 1 
    Testing features: 53 x 60 
     Testing outcome: 53 x 1 

[2019-08-02 17:27:40 s.CART] Training CART... 

[[ CART Classification Training Summary ]]
                   Reference 
        Estimated  M   R   
                M  83   0
                R   0  72

                   Overall  
      Sensitivity  1      
      Specificity  1      
Balanced Accuracy  1      
              PPV  1      
              NPV  1      
               F1  1      
         Accuracy  1      
              AUC  1      

  Positive Class:  M 

[[ CART Classification Testing Summary ]]
                   Reference 
        Estimated  M   R   
                M  21   7
                R   7  18

                   Overall  
      Sensitivity  0.7500 
      Specificity  0.7200 
Balanced Accuracy  0.7350 
              PPV  0.7500 
              NPV  0.7200 
               F1  0.7500 
         Accuracy  0.7358 
              AUC  0.7350 

  Positive Class:  M 

[2019-08-02 17:27:40 s.CART] Run completed in 1.8e-03 minutes (Real: 0.11; User: 0.08; System: 0.01) 

14.2.2 Bagged CARTs

[2019-08-02 17:27:41 bag] Hello, egenn 

[2019-08-02 17:27:41 dataPrepare] Imbalanced classes: using Inverse Probability Weighting 

[[ Classification Input Summary ]]
   Training features: 155 x 60 
    Training outcome: 155 x 1 
    Testing features: 53 x 60 
     Testing outcome: 53 x 1 

[[ Parameters ]]
          mod: cart 
   mod.params:  
               maxdepth: 10 
[2019-08-02 17:27:41 bag] Bagging 20 Classification and Regression Trees... 

[2019-08-02 17:27:41 resLearn] Training Classification and Regression Trees on 20 bootstrap resamples... 
[2019-08-02 17:27:41 resLearn] Parallelizing by forking on 4 cores... 

[[ Classification Training Summary ]]
                   Reference 
        Estimated  M   R   
                M  82   0
                R   1  72

                   Overall  
      Sensitivity  0.9880 
      Specificity  1.0000 
Balanced Accuracy  0.9940 
              PPV  1.0000 
              NPV  0.9863 
               F1  0.9939 
         Accuracy  0.9935 

  Positive Class:  M 

[[ Classification Testing Summary ]]
                   Reference 
        Estimated  M   R   
                M  23   4
                R   5  21

                   Overall  
      Sensitivity  0.8214 
      Specificity  0.8400 
Balanced Accuracy  0.8307 
              PPV  0.8519 
              NPV  0.8077 
               F1  0.8364 
         Accuracy  0.8302 

  Positive Class:  M 

[2019-08-02 17:27:43 bag] Run completed in 0.03 minutes (Real: 2.06; User: 0.32; System: 0.26)