# 11 elevate: Automatic tuning & testing

```
.:rtemis 0.79: Welcome, egenn
[x86_64-apple-darwin15.6.0 (64-bit): Defaulting to 4/4 available cores]
Online documentation & vignettes: https://rtemis.netlify.com
```

**rtemis** supports a large number of algorithms for supervised learning. Individual functions to access each algorithm begin with `s.`

. These function will output a single trained model and may, optionally, perform internal resampling of the training set to tune hyperparameters before training a final model on the full training set. You can get a full list of supported algorithms by running `modSelect()`

.

**elevate** is the main supervised learning function which performs nested resampling to tune hyperparameters (*inner resampling*) and assess generalizability (*outer resampling*) using any rtemis learner. All supervised learning functions (`s.`

functions and **elevate**) can accept either a feature matrix / data frame, `x`

, and an outcome vector, `y`

, separately, or a combined dataset `x`

alone, in which case the last column should be the outcome.

For classification, the outcome should be a factor where the first level is the ‘positive’ case.

This vignette will walk through the analysis of an example dataset using elevate

## 11.1 Classification

Let’s use the sonar dataset, available in the **mlbench** package.

```
[2019-08-02 17:14:27 elevate] Hello, egenn
[[ Classification Input Summary ]]
Training features: 208 x 60
Training outcome: 208 x 1
[2019-08-02 17:14:28 resLearn] Training Random Forest (ranger) on 10 stratified subsamples...
[[ elevate RANGER ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean Balanced Accuracy of 10 test sets in each repeat = 0.83
```

```
[2019-08-02 17:14:31 elevate] Run completed in 0.08 minutes (Real: 4.52; User: 5.35; System: 0.26)
```

By default, **elevate** uses random forest (using the **ranger** package which uses all available CPU threads) on 10 stratified subsamples to assess generalizability, with a 80% training - 20% testing split.

### 11.1.1 Plot confusion matrix

The output of **elevate** is an object that includes methods for plotting.
`$plot()`

plots the confusion matrix of all aggregated test sets

It is really an alias for `fit$plotPredicted()`

. The confusion matrix of the aggregated training sets can be plotted using `fit$plotFitted()`

.

### 11.1.2 Plot ROC

`$plotROC()`

Similarly to `fit$plot()`

, `fit$plotROC`

is an alias for `fit$plotROCpredicted`

and `fit$plotROCfitted`

is also available.

### 11.1.3 Plot variable importance

Finally, `fit$plotVarImp()`

plots the variabple importance of the predictors. Use the `plot.top`

argument to limit to this many top features.

### 11.1.4 Describe

Each **elevate** object includes a very nifty `describe`

function:

`Classification was performed using Random Forest (ranger). Model generalizability was assessed using 10 stratified subsamples. The mean Balanced Accuracy across all resamples was 0.83.`

## 11.2 Regression

### 11.2.1 Create synthetic data

We create an input matrix of random numbers drawn from a normal distribution using `rnormmat`

, and a vector of random weights.

We matrix multiply the the input matrix with the weights and add some noise to create our output.

Finally, we replace some values with NA.

### 11.2.2 Scenario 1: checkData - preprocess - elevate

#### 11.2.2.1 Step 1: Check data with **checkData**

First step for every analysis should be to get some information on our data and perform some basic checks.

```
Dataset: x
[[ Summary ]]
400 cases with 20 features:
* 20 continuous features
* 0 integer features
* 0 categorical features
* 0 constant features
* 14 features include 'NA' values; 30 'NA' values total
** Max percent missing in a feature is 1.25 % (V17)
** Max percent missing in a case is 10 % (case #125)
[[ Recommendations ]]
* Consider imputing missing values or use complete cases only
```

#### 11.2.2.2 Step 2: Preprocess data with **preprocess**

```
[2019-08-02 17:14:35 preprocess] Imputing missing values using missRanger...
Missing value imputation by random forests
Variables to impute: V2, V3, V4, V5, V6, V7, V9, V10, V11, V13, V14, V17, V18, V19
Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20
iter 1: ..............
[2019-08-02 17:14:38 preprocess] Done
```

Check the data again:

```
Dataset: x
[[ Summary ]]
400 cases with 20 features:
* 20 continuous features
* 0 integer features
* 0 categorical features
* 0 constant features
* 0 features include 'NA' values
[[ Recommendations ]]
* Everything looks good
```

#### 11.2.2.3 3. Train and test a model using 10 stratified subsamples

```
[2019-08-02 17:14:38 elevate] Hello, egenn
[[ Regression Input Summary ]]
Training features: 400 x 20
Training outcome: 400 x 1
[2019-08-02 17:14:38 resLearn] Training Multivariate Adaptive Regression Splines on 10 stratified subsamples...
[[ elevate MARS ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 2.71
Mean MSE reduction in each repeat = 84.04%
```

```
[2019-08-02 17:14:41 elevate] Run completed in 0.05 minutes (Real: 2.91; User: 1.90; System: 0.10)
```

#### 11.2.2.4 4. Plot true vs predicted

#### 11.2.2.5 Describe

`Regression was performed using Multivariate Adaptive Regression Splines. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.84.`

### 11.2.3 Scenario 2: elevate + preprocess

`elevate`

allows you to automatically run `preprocess`

on a dataset by specifying the `.preprocess`

argument.

In **rtemis**, arguments that add an extra step to the pipeline begin with a dot.

`elevate`

’s `.preprocess`

accepts the same arguments as the `preprocess`

function.

For cases like this, **rtemis** provides helpers functions which provide autocomplete functionality so as to avoid having to look up the original function’s usage (in this case, `preprocess`

).

We create a wide feature set and combine `x`

and `y`

to show how elevate can work directly on a single data frame where the last column is the output. For this example, we shall use projection pursuit regression.

```
x <- rnormmat(400, 100, seed = 2018)
w <- rnorm(100)
y <- x %*% w + rnorm(400)
x[sample(length(x), 60)] <- NA
dat <- data.frame(x, y)
fit <- elevate(dat, mod = 'ppr', .preprocess = rtset.preprocess(impute = TRUE))
```

```
[2019-08-02 17:25:26 elevate] Hello, egenn
[[ Regression Input Summary ]]
Training features: 400 x 100
Training outcome: 400 x 1
[2019-08-02 17:25:28] Imputing missing values using missRanger...
Missing value imputation by random forests
Variables to impute: X1, X5, X7, X9, X12, X13, X20, X21, X22, X24, X25, X27, X28, X30, X34, X35, X37, X40, X42, X44, X46, X47, X50, X52, X55, X56, X57, X58, X63, X65, X66, X67, X68, X70, X72, X76, X80, X82, X83, X86, X89, X90, X91, X93, X94, X98, X99, X100
Variables used to impute: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
iter 1: ................................................
[2019-08-02 17:25:46] Done
[2019-08-02 17:25:46 resLearn] Training Projection Pursuit Regression on 10 stratified subsamples...
[[ elevate PPR ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 3.58
Mean MSE reduction in each repeat = 96.48%
```

```
[2019-08-02 17:25:55 elevate] Run completed in 0.47 minutes (Real: 28.32; User: 57.64; System: 0.62)
```

Notice how each message includes the date and time, followed by the name of the function being executed.

For example, above, note how `preprocess.default`

comes in to perform data imputation before model training.

`preprocess.default`

signifies it is working on an object of class `data.frame`

. There is also a similar `preprocess.data.table`

that works on `data.table`

objects. This is an example of how `R`

automatically choose the appropriate function depending on input type.

`Regression was performed using Projection Pursuit Regression. Data was preprocessed by imputing missing values using missRanger. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.96.`

### 11.2.4 Scenario 3: elevate + decompose

`elevate`

can also decompose a dataset ahead of modeling. We can direct `elevate`

to perform decomposition ahead of modeling using the `.decompose`

argument.

```
x <- rnormmat(400, 200)
w <- rnorm(200)
y <- x %*% w + rnorm(400)
dat <- data.frame(x, y)
fit <- elevate(dat, 'glm', .decompose = rtset.decompose(decom = "PCA", k = 10))
```

```
[2019-08-02 17:25:55 elevate] Hello, egenn
[[ Regression Input Summary ]]
Training features: 400 x 200
Training outcome: 400 x 1
[2019-08-02 17:25:55 d.PCA] Hello, egenn
[2019-08-02 17:25:55 d.PCA] ||| Input has dimensions 400 rows by 200 columns,
[2019-08-02 17:25:55 d.PCA] interpreted as 400 cases with 200 features.
[2019-08-02 17:25:55 d.PCA] Performing Principal Component Analysis...
[2019-08-02 17:25:55 d.PCA] Run completed in 2.9e-03 minutes (Real: 0.17; User: 0.15; System: 0.01)
[[ Regression Input Summary ]]
Training features: 400 x 10
Training outcome: 400 x 1
[2019-08-02 17:25:55 resLearn] Training Generalized Linear Model on 10 stratified subsamples...
[[ elevate GLM ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 185.36
Mean MSE reduction in each repeat = -1.88%
```

```
[2019-08-02 17:25:56 elevate] Run completed in 0.01 minutes (Real: 0.53; User: 0.44; System: 0.04)
```

`Regression was performed using Generalized Linear Model. Input was projected to 10 dimensions using Principal Component Analysis. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was -0.02.`