17 Handling Imbalanced Data

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.

To handle imbalanced data, we make considerations during model training and assessment.

17.1 Model Training

There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:

  • Inverse Probability Weighting
    We relatively up-weigh cases of the minority class and down-weigh the majority class. This is called Inverse Probability Weighting (IPW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argument ipw controls whether IPW is used. It is TRUE by default in all learners.

  • Upsampling the minority class
    We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IPW. The logical argument upsample in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes the ipw argument irrelevant as the sample becomes balanced)

  • Downsampling the majority class
    Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argument downsample controls this behavior.

17.2 Classification model performance metrics

During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError function in all classification problems:

  • Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.

  • F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]

  • AUC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity

17.3 Example dataset

Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository

Registered S3 method overwritten by 'R.oo':
  method        from       
  throw.default R.methodsS3
  Dataset: dat 

  [[ Summary ]]
  3163 cases with 26 features: 
  * 0 continuous features 
  * 25 integer features 
  * 1 categorical feature, which is not ordered
  * 0 constant features 
  * 0 features include 'NA' values

  [[ Recommendations ]]
  * Everything looks good

Let’s see how many cases we have per class in our outcome:


   1    0 
3012  151 

17.3.1 Class Imbalance

We can use the Class Imbalance formula using the classImbalance function in rtemis:

\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]

[1] 0.8181583

Let’s create some resample to train and test models:

17.4 GLM

17.4.1 No imbalance correction

Let’s train a GLM without inverse probability weighting or upsampling. Since IPW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE:

[2019-08-02 18:01:36 s.GLM] Hello, egenn 

[[ Classification Input Summary ]]
   Training features: 2372 x 25 
    Training outcome: 2372 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[2019-08-02 18:01:38 s.GLM] Training GLM... 
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Training Summary ]]
                   Reference 
        Estimated  1     0   
                1  2241  80
                0    18  33

                   Overall  
      Sensitivity  0.9920 
      Specificity  0.2920 
Balanced Accuracy  0.6420 
              PPV  0.9655 
              NPV  0.6471 
               F1  0.9786 
         Accuracy  0.9587 
              AUC  0.9365 

  Positive Class:  1 
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  746  25
                0    7  13

                   Overall  
      Sensitivity  0.9907 
      Specificity  0.3421 
Balanced Accuracy  0.6664 
              PPV  0.9676 
              NPV  0.6500 
               F1  0.9790 
         Accuracy  0.9595 
              AUC  0.9426 

  Positive Class:  1 

[2019-08-02 18:01:38 s.GLM] Run completed in 0.03 minutes (Real: 1.75; User: 1.29; System: 0.12) 

We get almost perfect Sensitivity, but very low Specificity.

17.4.2 IPW

Let’s turn IPW on:

[2019-08-02 18:01:38 s.GLM] Hello, egenn 

[2019-08-02 18:01:38 dataPrepare] Imbalanced classes: using Inverse Probability Weighting 

[[ Classification Input Summary ]]
   Training features: 2372 x 25 
    Training outcome: 2372 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[2019-08-02 18:01:38 s.GLM] Training GLM... 
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Training Summary ]]
                   Reference 
        Estimated  1     0    
                1  1952    8
                0   307  105

                   Overall  
      Sensitivity  0.8641 
      Specificity  0.9292 
Balanced Accuracy  0.8967 
              PPV  0.9959 
              NPV  0.2549 
               F1  0.9253 
         Accuracy  0.8672 
              AUC  0.9389 

  Positive Class:  1 
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  653   4
                0  100  34

                   Overall  
      Sensitivity  0.8672 
      Specificity  0.8947 
Balanced Accuracy  0.8810 
              PPV  0.9939 
              NPV  0.2537 
               F1  0.9262 
         Accuracy  0.8685 
              AUC  0.9522 

  Positive Class:  1 

[2019-08-02 18:01:39 s.GLM] Run completed in 4.8e-03 minutes (Real: 0.29; User: 0.16; System: 0.04) 

This looks much better! Sensitivity dropped a little, but Specificity improved a lot and they are the two are now very close.

17.4.3 Upsampling

Let’s try upsampling instead of IPW:

[2019-08-02 18:01:40 s.GLM] Hello, egenn 

[2019-08-02 18:01:40 dataPrepare] Upsampling to create balanced set... 
[2019-08-02 18:01:40 dataPrepare] 1 is majority outcome with length = 2259 

[[ Classification Input Summary ]]
   Training features: 4518 x 25 
    Training outcome: 4518 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[2019-08-02 18:01:40 s.GLM] Training GLM... 
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Training Summary ]]
                   Reference 
        Estimated  1     0     
                1  1706   160
                0   553  2099

                   Overall  
      Sensitivity  0.7552 
      Specificity  0.9292 
Balanced Accuracy  0.8422 
              PPV  0.9143 
              NPV  0.7915 
               F1  0.8272 
         Accuracy  0.8422 
              AUC  0.8422 

  Positive Class:  1 
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  578   1
                0  175  37

                   Overall  
      Sensitivity  0.7676 
      Specificity  0.9737 
Balanced Accuracy  0.8706 
              PPV  0.9983 
              NPV  0.1745 
               F1  0.8679 
         Accuracy  0.7775 
              AUC  0.8706 

  Positive Class:  1 

[2019-08-02 18:01:41 s.GLM] Run completed in 0.03 minutes (Real: 1.70; User: 1.04; System: 0.15) 

In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.

17.4.4 Downsampling

[2019-08-02 18:01:42 s.GLM] Hello, egenn 

[2019-08-02 18:01:42 dataPrepare] Downsampling to balance outcome classes... 
[2019-08-02 18:01:42 dataPrepare] 0 is the minority outcome with 113 cases 

[[ Classification Input Summary ]]
   Training features: 226 x 25 
    Training outcome: 226 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[2019-08-02 18:01:42 s.GLM] Training GLM... 
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Training Summary ]]
                   Reference 
        Estimated  1    0    
                1  106    7
                0    7  106

                   Overall  
      Sensitivity  0.9381 
      Specificity  0.9381 
Balanced Accuracy  0.9381 
              PPV  0.9381 
              NPV  0.9381 
               F1  0.9381 
         Accuracy  0.9381 
              AUC  0.9738 

  Positive Class:  1 
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
== : prediction from a rank-deficient fit may be misleading

[[ LOGISTIC Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  597   1
                0  156  37

                   Overall  
      Sensitivity  0.7928 
      Specificity  0.9737 
Balanced Accuracy  0.8833 
              PPV  0.9983 
              NPV  0.1917 
               F1  0.8838 
         Accuracy  0.8015 
              AUC  0.9219 

  Positive Class:  1 

[2019-08-02 18:01:42 s.GLM] Run completed in 2.9e-03 minutes (Real: 0.17; User: 0.10; System: 0.02) 

Similar results to upsampling, in this case.

17.5 Random forest

Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.

17.5.1 No imbalance correction

Again, let’s begin by training a model with no correction for imbalanced data:

[2019-08-02 18:01:42 s.RANGER] Hello, egenn 

[[ Classification Input Summary ]]
   Training features: 2372 x 25 
    Training outcome: 2372 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[[ Parameters ]]
   n.trees: 1000 
      mtry: NULL 

[2019-08-02 18:01:42 s.RANGER] Training Random Forest (ranger) Classification with 1000 trees... 

[[ RANGER Classification Training Summary ]]
                   Reference 
        Estimated  1     0    
                1  2258    3
                0     1  110

                   Overall  
      Sensitivity  0.9996 
      Specificity  0.9735 
Balanced Accuracy  0.9865 
              PPV  0.9987 
              NPV  0.9910 
               F1  0.9991 
         Accuracy  0.9983 
              AUC  1.0000 

  Positive Class:  1 

[[ RANGER Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  751  13
                0    2  25

                   Overall  
      Sensitivity  0.9973 
      Specificity  0.6579 
Balanced Accuracy  0.8276 
              PPV  0.9830 
              NPV  0.9259 
               F1  0.9901 
         Accuracy  0.9810 
              AUC  0.9911 

  Positive Class:  1 

[2019-08-02 18:01:44 s.RANGER] Run completed in 0.03 minutes (Real: 1.56; User: 2.98; System: 0.16) 

17.5.2 IPW: Case weights

Now, with IPW. By Default, s.RANGER, uses IPW to define case weights (i.e. ipw.case.weights = TRUE):

[2019-08-02 18:01:44 s.RANGER] Hello, egenn 

[2019-08-02 18:01:44 dataPrepare] Imbalanced classes: using Inverse Probability Weighting 

[[ Classification Input Summary ]]
   Training features: 2372 x 25 
    Training outcome: 2372 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[[ Parameters ]]
   n.trees: 1000 
      mtry: NULL 

[2019-08-02 18:01:44 s.RANGER] Training Random Forest (ranger) Classification with 1000 trees... 

[[ RANGER Classification Training Summary ]]
                   Reference 
        Estimated  1     0    
                1  2245    0
                0    14  113

                   Overall  
      Sensitivity  0.9938 
      Specificity  1.0000 
Balanced Accuracy  0.9969 
              PPV  1.0000 
              NPV  0.8898 
               F1  0.9969 
         Accuracy  0.9941 
              AUC  0.9998 

  Positive Class:  1 

[[ RANGER Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  748   8
                0    5  30

                   Overall  
      Sensitivity  0.9934 
      Specificity  0.7895 
Balanced Accuracy  0.8914 
              PPV  0.9894 
              NPV  0.8571 
               F1  0.9914 
         Accuracy  0.9836 
              AUC  0.9890 

  Positive Class:  1 

[2019-08-02 18:01:46 s.RANGER] Run completed in 0.03 minutes (Real: 1.72; User: 3.50; System: 0.10) 

Again, IPW increases the Specificity.

17.5.3 IPW: Class weights

Alternatively, we can use IPW to define class weights:

[2019-08-02 18:01:46 s.RANGER] Hello, egenn 

[2019-08-02 18:01:46 dataPrepare] Imbalanced classes: using Inverse Probability Weighting 

[[ Classification Input Summary ]]
   Training features: 2372 x 25 
    Training outcome: 2372 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[[ Parameters ]]
   n.trees: 1000 
      mtry: NULL 

[2019-08-02 18:01:46 s.RANGER] Training Random Forest (ranger) Classification with 1000 trees... 

[[ RANGER Classification Training Summary ]]
                   Reference 
        Estimated  1     0    
                1  2258    3
                0     1  110

                   Overall  
      Sensitivity  0.9996 
      Specificity  0.9735 
Balanced Accuracy  0.9865 
              PPV  0.9987 
              NPV  0.9910 
               F1  0.9991 
         Accuracy  0.9983 
              AUC  1.0000 

  Positive Class:  1 

[[ RANGER Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  751  15
                0    2  23

                   Overall  
      Sensitivity  0.9973 
      Specificity  0.6053 
Balanced Accuracy  0.8013 
              PPV  0.9804 
              NPV  0.9200 
               F1  0.9888 
         Accuracy  0.9785 
              AUC  0.9925 

  Positive Class:  1 

[2019-08-02 18:01:48 s.RANGER] Run completed in 0.02 minutes (Real: 1.33; User: 2.80; System: 0.13) 

17.5.4 Upsampling

Now try upsampling:

[2019-08-02 18:01:48 s.RANGER] Hello, egenn 

[2019-08-02 18:01:48 dataPrepare] Upsampling to create balanced set... 
[2019-08-02 18:01:48 dataPrepare] 1 is majority outcome with length = 2259 

[[ Classification Input Summary ]]
   Training features: 4518 x 25 
    Training outcome: 4518 x 1 
    Testing features: 791 x 25 
     Testing outcome: 791 x 1 

[[ Parameters ]]
   n.trees: 1000 
      mtry: NULL 

[2019-08-02 18:01:48 s.RANGER] Training Random Forest (ranger) Classification with 1000 trees... 

[[ RANGER Classification Training Summary ]]
                   Reference 
        Estimated  1     0     
                1  2255     0
                0     4  2259

                   Overall  
      Sensitivity  0.9982 
      Specificity  1.0000 
Balanced Accuracy  0.9991 
              PPV  1.0000 
              NPV  0.9982 
               F1  0.9991 
         Accuracy  0.9991 
              AUC  1.0000 

  Positive Class:  1 

[[ RANGER Classification Testing Summary ]]
                   Reference 
        Estimated  1    0   
                1  749   9
                0    4  29

                   Overall  
      Sensitivity  0.9947 
      Specificity  0.7632 
Balanced Accuracy  0.8789 
              PPV  0.9881 
              NPV  0.8788 
               F1  0.9914 
         Accuracy  0.9836 
              AUC  0.9895 

  Positive Class:  1 

[2019-08-02 18:01:51 s.RANGER] Run completed in 0.05 minutes (Real: 3.28; User: 7.14; System: 0.33) 

Coming soon: Model Calibration