# 11 elevate: Automatic tuning & testing

```
.:rtemis 0.8.0: Welcome, egenn
[x86_64-apple-darwin17.0 (64-bit): Defaulting to 4/4 available cores]
Documentation & vignettes: https://rtemis.netlify.com
```

**rtemis** supports a large number of algorithms for supervised learning. Individual functions to access each algorithm begin with `s.`

. These function will output a single trained model and may, optionally, perform internal resampling of the training set to tune hyperparameters before training a final model on the full training set. You can get a full list of supported algorithms by running `modSelect()`

.

**elevate** is the main supervised learning function which performs nested resampling to tune hyperparameters (*inner resampling*) and assess generalizability (*outer resampling*) using any rtemis learner. All supervised learning functions (`s.`

functions and **elevate**) can accept either a feature matrix / data frame, `x`

, and an outcome vector, `y`

, separately, or a combined dataset `x`

alone, in which case the last column should be the outcome.

For classification, the outcome should be a factor where the first level is the ‘positive’ case.

This vignette will walk through the analysis of an example dataset using elevate

## 11.1 Classification

Let’s use the sonar dataset, available in the **mlbench** package.

```
[2020-06-23 08:20:09 elevate] Hello, egenn
[[ Classification Input Summary ]]
Training features: 208 x 60
Training outcome: 208 x 1
[2020-06-23 08:20:11 resLearn] Training Random Forest (ranger) on 10 stratified subsamples...
[[ elevate RANGER ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean Balanced Accuracy of 10 test sets in each repeat = 0.81
```

```
[2020-06-23 08:20:13 elevate] Run completed in 0.07 minutes (Real: 4.19; User: 5.21; System: 0.25)
```

By default, **elevate** uses random forest (using the **ranger** package which uses all available CPU threads) on 10 stratified subsamples to assess generalizability, with a 80% training - 20% testing split.

### 11.1.1 Plot confusion matrix

The output of **elevate** is an object that includes methods for plotting.
`$plot()`

plots the confusion matrix of all aggregated test sets

It is really an alias for `fit$plotPredicted()`

. The confusion matrix of the aggregated training sets can be plotted using `fit$plotFitted()`

.

### 11.1.2 Plot ROC

`$plotROC()`

Similarly to `fit$plot()`

, `fit$plotROC`

is an alias for `fit$plotROCpredicted`

and `fit$plotROCfitted`

is also available.

### 11.1.3 Plot variable importance

Finally, `fit$plotVarImp()`

plots the variabple importance of the predictors. Use the `plot.top`

argument to limit to this many top features.

## 11.2 Regression

### 11.2.1 Create synthetic data

We create an input matrix of random numbers drawn from a normal distribution using `rnormmat`

, and a vector of random weights.

We matrix multiply the the input matrix with the weights and add some noise to create our output.

Finally, we replace some values with NA.

### 11.2.2 Scenario 1: checkData - preprocess - elevate

#### 11.2.2.1 Step 1: Check data with **checkData**

First step for every analysis should be to get some information on our data and perform some basic checks.

```
Dataset: x
[[ Summary ]]
400 cases with 20 features:
* 20 continuous features
* 0 integer features
* 0 categorical features
* 0 constant features
* 0 duplicated cases
* 15 features include 'NA' values; 30 'NA' values total
** Max percent missing in a feature is 1.50% (V14)
** Max percent missing in a case is 10% (case #163)
[[ Recommendations ]]
* Consider imputing missing values or use complete cases only
```

#### 11.2.2.2 Step 2: Preprocess data with **preprocess**

```
[2020-06-23 08:20:15 preprocess] Imputing missing values using missRanger...
Missing value imputation by random forests
Variables to impute: V1, V2, V3, V4, V5, V8, V9, V11, V12, V13, V14, V17, V18, V19, V20
Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20
iter 1: ...............
[2020-06-23 08:20:18 preprocess] Done
```

Check the data again:

```
Dataset: x
[[ Summary ]]
400 cases with 20 features:
* 20 continuous features
* 0 integer features
* 0 categorical features
* 0 constant features
* 0 duplicated cases
* 0 features include 'NA' values
[[ Recommendations ]]
* Everything looks good
```

#### 11.2.2.3 3. Train and test a model using 10 stratified subsamples

```
[2020-06-23 08:20:18 elevate] Hello, egenn
```

```
Warning in if (class(y) == "character") {: the condition has length > 1 and only
the first element will be used
```

```
[[ Regression Input Summary ]]
Training features: 400 x 20
Training outcome: 400 x 1
[2020-06-23 08:20:18 resLearn] Training Multivariate Adaptive Regression Splines on 10 stratified subsamples...
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ Parameters ]]
pmethod: forward
degree: 2
nprune: NULL
ncross: 1
nfold: 4
penalty: 3
nk: 41
[[ elevate MARS ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 7.66
Mean MSE reduction in each repeat = 75.60%
```

```
[2020-06-23 08:20:20 elevate] Run completed in 0.04 minutes (Real: 2.47; User: 1.90; System: 0.13)
```

### 11.2.3 Scenario 2: elevate + preprocess

`elevate`

allows you to automatically run `preprocess`

on a dataset by specifying the `.preprocess`

argument.

In **rtemis**, arguments that add an extra step to the pipeline begin with a dot.

`elevate`

’s `.preprocess`

accepts the same arguments as the `preprocess`

function.

For cases like this, **rtemis** provides helpers functions which provide autocomplete functionality so as to avoid having to look up the original function’s usage (in this case, `preprocess`

).

We create a wide feature set and combine `x`

and `y`

to show how elevate can work directly on a single data frame where the last column is the output. For this example, we shall use projection pursuit regression.

```
x <- rnormmat(400, 100, seed = 2018)
w <- rnorm(100)
y <- x %*% w + rnorm(400)
x[sample(length(x), 60)] <- NA
dat <- data.frame(x, y)
fit <- elevate(dat, mod = 'ppr', .preprocess = rtset.preprocess(impute = TRUE))
```

```
[2020-06-23 08:20:21 elevate] Hello, egenn
[[ Regression Input Summary ]]
Training features: 400 x 100
Training outcome: 400 x 1
[2020-06-23 08:20:21 resLearn] Training Projection Pursuit Regression on 10 stratified subsamples...
[[ elevate PPR ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 4.30
Mean MSE reduction in each repeat = 95.83%
```

```
[2020-06-23 08:22:10 elevate] Run completed in 1.80 minutes (Real: 108.23; User: 287.55; System: 2.71)
```

Notice how each message includes the date and time, followed by the name of the function being executed.

For example, above, note how `preprocess.default`

comes in to perform data imputation before model training.

`preprocess.default`

signifies it is working on an object of class `data.frame`

. There is also a similar `preprocess.data.table`

that works on `data.table`

objects. This is an example of how `R`

automatically choose the appropriate function depending on input type.

`Regression was performed using Projection Pursuit Regression. Data was preprocessed by imputing missing values using missRanger. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.96.`

### 11.2.4 Scenario 3: elevate + decompose

`elevate`

can also decompose a dataset ahead of modeling. We can direct `elevate`

to perform decomposition ahead of modeling using the `.decompose`

argument.

```
x <- rnormmat(400, 200)
w <- rnorm(200)
y <- x %*% w + rnorm(400)
dat <- data.frame(x, y)
fit <- elevate(dat, 'glm', .decompose = rtset.decompose(decom = "PCA", k = 10))
```

```
[2020-06-23 08:22:10 elevate] Hello, egenn
[[ Regression Input Summary ]]
Training features: 400 x 200
Training outcome: 400 x 1
[2020-06-23 08:22:10 d.PCA] Hello, egenn
[2020-06-23 08:22:10 d.PCA] ||| Input has dimensions 400 rows by 200 columns,
[2020-06-23 08:22:10 d.PCA] interpreted as 400 cases with 200 features.
[2020-06-23 08:22:10 d.PCA] Performing Principal Component Analysis...
[2020-06-23 08:22:10 d.PCA] Run completed in 3.4e-03 minutes (Real: 0.20; User: 0.19; System: 0.01)
[[ Regression Input Summary ]]
Training features: 400 x 10
Training outcome: 400 x 1
[2020-06-23 08:22:10 resLearn] Training Generalized Linear Model on 10 stratified subsamples...
[[ elevate GLM ]]
N repeats = 1
N resamples = 10
Resampler = strat.sub
Mean MSE of 10 resamples in each repeat = 188.62
Mean MSE reduction in each repeat = 14.97%
```

```
[2020-06-23 08:22:11 elevate] Run completed in 0.01 minutes (Real: 0.60; User: 0.52; System: 0.06)
```

`Regression was performed using Generalized Linear Model. Input was projected to 10 dimensions using Principal Component Analysis. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.15.`