15 Boosting

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

Boosting is one of the most powerful techniques in supervised learning. rtemis allows you to easily apply boosting to any learner for regression (but, like bagging, don’t try boosting an ordinary least squares model).

Let’s create some synthetic data:

15.1 Boost CART stumps

Boosting works best by training a series of many weak learners. Let’s start by boosting the simplest trees, those with depth = 1, a.k.a. stumps.

[2019-08-02 17:28:51 boost] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 50 
    Training outcome: 374 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
               mod: CART 
        mod.params:  
                    maxdepth: 1 
              init: -0.182762669446564 
          max.iter: 50 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
[2019-08-02 17:28:53 boost] [ Boosting Classification and Regression Trees... ] 
[2019-08-02 17:28:53 boost] Iteration #5: Training MSE = 49.08; Validation MSE = 52.02 
[2019-08-02 17:28:53 boost] Iteration #10: Training MSE = 45.91; Validation MSE = 49.65 
[2019-08-02 17:28:54 boost] Iteration #15: Training MSE = 43.30; Validation MSE = 47.54 
[2019-08-02 17:28:54 boost] Iteration #20: Training MSE = 40.92; Validation MSE = 45.75 
[2019-08-02 17:28:54 boost] Iteration #25: Training MSE = 38.78; Validation MSE = 44.10 
[2019-08-02 17:28:54 boost] Iteration #30: Training MSE = 36.85; Validation MSE = 42.97 
[2019-08-02 17:28:55 boost] Iteration #35: Training MSE = 35.08; Validation MSE = 41.76 
[2019-08-02 17:28:55 boost] Iteration #40: Training MSE = 33.45; Validation MSE = 40.69 
[2019-08-02 17:28:55 boost] Iteration #45: Training MSE = 31.93; Validation MSE = 39.59 
[2019-08-02 17:28:55 boost] Iteration #50: Training MSE = 30.53; Validation MSE = 38.68 
[2019-08-02 17:28:55 boost] Reached max iterations 


[[ Regression Training Summary ]]
    MSE = 30.53 (42.94%)
   RMSE = 5.53 (24.46%)
    MAE = 4.36 (23.22%)
      r = 0.82 (p = 1.3e-91)
    rho = 0.77 (p = 0)
   R sq = 0.43

[2019-08-02 17:28:55 boost] Run completed in 0.06 minutes (Real: 3.90; User: 2.74; System: 0.26) 

We notice the validation error is quite higher than the training error and is also less smooth.

15.2 Boost CART stumps: step slower

To get better results out of boosting, it usually helps to decrease the learning rate and increase the number of steps. From an optimization point of view, the lower learning rate does not mean that you simply take more, smallet steps instead of fewer bigger steps, but it makes you follow a different, more precise optimization path.

[2019-08-02 17:29:01 boost] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 50 
    Training outcome: 374 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
               mod: CART 
        mod.params:  
                    maxdepth: 1 
              init: -0.182762669446564 
          max.iter: 500 
     learning.rate: 0.05 
         tolerance: 0 
   tolerance.valid: 1e-05 
[2019-08-02 17:29:01 boost] [ Boosting Classification and Regression Trees... ] 
[2019-08-02 17:29:05 boost] Iteration #100: Training MSE = 30.84; Validation MSE = 38.78 
[2019-08-02 17:29:10 boost] Iteration #200: Training MSE = 20.96; Validation MSE = 31.99 
[2019-08-02 17:29:14 boost] Iteration #300: Training MSE = 15.16; Validation MSE = 27.27 
[2019-08-02 17:29:18 boost] Iteration #400: Training MSE = 11.38; Validation MSE = 23.75 
[2019-08-02 17:29:22 boost] Iteration #500: Training MSE = 8.78; Validation MSE = 21.26 
[2019-08-02 17:29:22 boost] Reached max iterations 


[[ Regression Training Summary ]]
    MSE = 8.78 (83.58%)
   RMSE = 2.96 (59.48%)
    MAE = 2.36 (58.35%)
      r = 0.95 (p = 1.1e-195)
    rho = 0.94 (p = 0)
   R sq = 0.84

[2019-08-02 17:29:22 boost] Run completed in 0.36 minutes (Real: 21.44; User: 17.12; System: 1.59) 

15.3 Boost deep CARTs

Let’s see what can go wrong if your base learners are too strong:

[2019-08-02 17:30:47 boost] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 50 
    Training outcome: 374 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
               mod: CART 
        mod.params:  
                    maxdepth: 20 
              init: -0.182762669446564 
          max.iter: 50 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
[2019-08-02 17:30:47 boost] [ Boosting Classification and Regression Trees... ] 
[2019-08-02 17:30:47 boost] Iteration #5: Training MSE = 25.97; Validation MSE = 44.30 
[2019-08-02 17:30:47 boost] Iteration #10: Training MSE = 12.59; Validation MSE = 36.50 
[2019-08-02 17:30:47 boost] Iteration #15: Training MSE = 6.06; Validation MSE = 33.27 
[2019-08-02 17:30:48 boost] Iteration #20: Training MSE = 2.96; Validation MSE = 31.19 
[2019-08-02 17:30:48 boost] Iteration #25: Training MSE = 1.46; Validation MSE = 30.16 
[2019-08-02 17:30:48 boost] Iteration #30: Training MSE = 0.72; Validation MSE = 29.47 
[2019-08-02 17:30:48 boost] Iteration #35: Training MSE = 0.35; Validation MSE = 28.75 
[2019-08-02 17:30:48 boost] Iteration #40: Training MSE = 0.17; Validation MSE = 28.29 
[2019-08-02 17:30:49 boost] Iteration #45: Training MSE = 0.09; Validation MSE = 28.09 
[2019-08-02 17:30:49 boost] Iteration #50: Training MSE = 0.04; Validation MSE = 27.86 
[2019-08-02 17:30:49 boost] Reached max iterations 


[[ Regression Training Summary ]]
    MSE = 0.04 (99.92%)
   RMSE = 0.20 (97.21%)
    MAE = 0.16 (97.10%)
      r = 1.00 (p = 0)
    rho = 1.00 (p = 0)
   R sq = 1.00

[2019-08-02 17:30:49 boost] Run completed in 0.04 minutes (Real: 2.47; User: 1.91; System: 0.18) 

We notice that training error quickly approached zero, while testing error remained high, i.e. the strong base learners overfit the data.

15.4 Boost any learner

While decision trees are the most common base learners used in boosting, you can boost any algorithm:

15.4.1 Projection pursuit Regression

[2019-08-02 17:30:55 boost] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 50 
    Training outcome: 374 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
               mod: PPR 
        mod.params: (empty list) 
              init: -0.182762669446564 
          max.iter: 10 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
[2019-08-02 17:30:55 boost] [ Boosting Projection Pursuit Regression... ] 
[2019-08-02 17:30:56 boost] Iteration #5: Training MSE = 18.72; Validation MSE = 20.33 
[2019-08-02 17:30:57 boost] Iteration #10: Training MSE = 6.55; Validation MSE = 7.79 
[2019-08-02 17:30:57 boost] Reached max iterations 


[[ Regression Training Summary ]]
    MSE = 6.55 (87.76%)
   RMSE = 2.56 (65.01%)
    MAE = 1.99 (64.99%)
      r = 1.00 (p = 0)
    rho = 1.00 (p = 0)
   R sq = 0.88

[2019-08-02 17:30:57 boost] Run completed in 0.03 minutes (Real: 1.87; User: 1.60; System: 0.05) 

15.4.2 Multivariate Adaptive Regression Splines (MARS)

[2019-08-02 17:30:57 boost] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 374 x 50 
    Training outcome: 374 x 1 
    Testing features: Not available
     Testing outcome: Not available

[[ Parameters ]]
               mod: MARS 
        mod.params: (empty list) 
              init: -0.182762669446564 
          max.iter: 30 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
[2019-08-02 17:30:57 boost] [ Boosting Multivariate Adaptive Regression Splines... ] 
[2019-08-02 17:30:59 boost] Iteration #5: Training MSE = 25.85; Validation MSE = 31.00 
[2019-08-02 17:31:01 boost] Iteration #10: Training MSE = 18.45; Validation MSE = 24.42 
[2019-08-02 17:31:02 boost] Iteration #15: Training MSE = 13.70; Validation MSE = 19.16 
[2019-08-02 17:31:03 boost] Iteration #20: Training MSE = 11.65; Validation MSE = 17.89 
[2019-08-02 17:31:04 boost] Iteration #25: Training MSE = 9.82; Validation MSE = 16.40 
[2019-08-02 17:31:04 boost] Iteration #30: Training MSE = 8.89; Validation MSE = 16.03 
[2019-08-02 17:31:04 boost] Reached max iterations 


[[ Regression Training Summary ]]
    MSE = 8.89 (83.39%)
   RMSE = 2.98 (59.25%)
    MAE = 2.35 (58.65%)
      r = 0.95 (p = 1e-192)
    rho = 0.94 (p = 0)
   R sq = 0.83

[2019-08-02 17:31:04 boost] Run completed in 0.12 minutes (Real: 7.12; User: 6.14; System: 0.23)