7 Data Preprocessing

  .:rtemis 0.79: Welcome, egenn
  [x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
  Online documentation & vignettes: https://rtemis.netlify.com

After visualization, comes data preprocessing. You may have read many quotes making the point that the majority of time in data science is spent cleaning / preprocessing data. Depending on the data, this is very often very true.

Let’s start with the Sonar dataset and add some missing values for this example.

7.1 Check data

To check your data, simply enough use the checkData function:

  Dataset: Sonar 

  [[ Summary ]]
  208 cases with 61 features: 
  * 60 continuous features 
  * 0 integer features 
  * 1 categorical feature, which is not ordered
  * 0 constant features 
  * 2 features include 'NA' values; 10 'NA' values total
    ** Max percent missing in a feature is 2.40 % (V1)
    ** Max percent missing in a case is 1.64 % (case #10)

  [[ Recommendations ]]
  * Consider imputing missing values or use complete cases only

The output produces a list of useful information about your dataset, followed by recommendations.

7.2 Preprocess

To clean / preprocess the data, use the preprocess command. In this case we want to impute missing data. By default, preprocess uses the missForest package to predict missing values from the available data using random forest in an iterative procedure.

[2019-08-02 17:08:44 preprocess] Imputing missing values using missRanger... 

Missing value imputation by random forests

  Variables to impute:      V1, V2
  Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class
iter 1: ..
iter 2: ..
iter 3: ..
iter 4: ..
[2019-08-02 17:08:51 preprocess] Done 

Let’s now check our preprocessed data:

  Dataset: Sonar.pre 

  [[ Summary ]]
  208 cases with 61 features: 
  * 60 continuous features 
  * 0 integer features 
  * 1 categorical feature, which is not ordered
  * 0 constant features 
  * 0 features include 'NA' values

  [[ Recommendations ]]
  * Everything looks good