11 elevate: Automatic tuning & testing

  .:rtemis 0.8.0: Welcome, egenn
  [x86_64-apple-darwin17.0 (64-bit): Defaulting to 4/4 available cores]
  Documentation & vignettes: https://rtemis.netlify.com

rtemis supports a large number of algorithms for supervised learning. Individual functions to access each algorithm begin with s.. These function will output a single trained model and may, optionally, perform internal resampling of the training set to tune hyperparameters before training a final model on the full training set. You can get a full list of supported algorithms by running modSelect().

elevate is the main supervised learning function which performs nested resampling to tune hyperparameters (inner resampling) and assess generalizability (outer resampling) using any rtemis learner. All supervised learning functions (s. functions and elevate) can accept either a feature matrix / data frame, x, and an outcome vector, y, separately, or a combined dataset x alone, in which case the last column should be the outcome.

For classification, the outcome should be a factor where the first level is the ‘positive’ case.
This vignette will walk through the analysis of an example dataset using elevate

11.1 Classification

Let’s use the sonar dataset, available in the mlbench package.

data(Sonar, package = "mlbench")
fit <- elevate(Sonar)
[2020-06-23 08:20:09 elevate] Hello, egenn 

[[ Classification Input Summary ]]
   Training features: 208 x 60 
    Training outcome: 208 x 1 

[2020-06-23 08:20:11 resLearn] Training Random Forest (ranger) on 10 stratified subsamples... 

[[ elevate RANGER ]]
   N repeats = 1 
   N resamples = 10 
   Resampler = strat.sub 
   Mean Balanced Accuracy of 10 test sets in each repeat = 0.81

[2020-06-23 08:20:13 elevate] Run completed in 0.07 minutes (Real: 4.19; User: 5.21; System: 0.25) 

By default, elevate uses random forest (using the ranger package which uses all available CPU threads) on 10 stratified subsamples to assess generalizability, with a 80% training - 20% testing split.

11.1.1 Plot confusion matrix

The output of elevate is an object that includes methods for plotting. $plot() plots the confusion matrix of all aggregated test sets

fit$plot()

It is really an alias for fit$plotPredicted(). The confusion matrix of the aggregated training sets can be plotted using fit$plotFitted().

11.1.2 Plot ROC

$plotROC()

fit$plotROC()

Similarly to fit$plot(), fit$plotROC is an alias for fit$plotROCpredicted and fit$plotROCfitted is also available.

11.1.3 Plot variable importance

Finally, fit$plotVarImp() plots the variabple importance of the predictors. Use the plot.top argument to limit to this many top features.

fit$plotVarImp(plot.top = 20)

11.1.4 Describe

Each elevate object includes a very nifty describe function:

fit$describe()
Classification was performed using Random Forest (ranger). Model generalizability was assessed using 10 stratified subsamples. The mean Balanced Accuracy across all resamples was 0.81.

11.2 Regression

11.2.1 Create synthetic data

We create an input matrix of random numbers drawn from a normal distribution using rnormmat, and a vector of random weights.
We matrix multiply the the input matrix with the weights and add some noise to create our output.
Finally, we replace some values with NA.

x <- rnormmat(400, 20)
w <- rnorm(20)
y <- x %*% w + rnorm(400)
x[sample(length(x), 30)] <- NA

11.2.2 Scenario 1: checkData - preprocess - elevate

11.2.2.1 Step 1: Check data with checkData

First step for every analysis should be to get some information on our data and perform some basic checks.

checkData(x)
  Dataset: x 

  [[ Summary ]]
  400 cases with 20 features: 
  * 20 continuous features 
  * 0 integer features 
  * 0 categorical features
  * 0 constant features 
  * 0 duplicated cases 
  * 15 features include 'NA' values; 30 'NA' values total
    ** Max percent missing in a feature is 1.50% (V14)
    ** Max percent missing in a case is 10% (case #163)

  [[ Recommendations ]]
  * Consider imputing missing values or use complete cases only

11.2.2.2 Step 2: Preprocess data with preprocess

x <- preprocess(x, impute = TRUE)
[2020-06-23 08:20:15 preprocess] Imputing missing values using missRanger... 

Missing value imputation by random forests

  Variables to impute:      V1, V2, V3, V4, V5, V8, V9, V11, V12, V13, V14, V17, V18, V19, V20
  Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20
iter 1: ...............
[2020-06-23 08:20:18 preprocess] Done 

Check the data again:

checkData(x)
  Dataset: x 

  [[ Summary ]]
  400 cases with 20 features: 
  * 20 continuous features 
  * 0 integer features 
  * 0 categorical features
  * 0 constant features 
  * 0 duplicated cases 
  * 0 features include 'NA' values

  [[ Recommendations ]]
  * Everything looks good

11.2.2.3 3. Train and test a model using 10 stratified subsamples

fit <- elevate(x, y, mod = 'mars')
[2020-06-23 08:20:18 elevate] Hello, egenn 
Warning in if (class(y) == "character") {: the condition has length > 1 and only
the first element will be used

[[ Regression Input Summary ]]
   Training features: 400 x 20 
    Training outcome: 400 x 1 

[2020-06-23 08:20:18 resLearn] Training Multivariate Adaptive Regression Splines on 10 stratified subsamples... 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ Parameters ]]
   pmethod: forward 
    degree: 2 
    nprune: NULL 
    ncross: 1 
     nfold: 4 
   penalty: 3 
        nk: 41 

[[ elevate MARS ]]
   N repeats = 1 
   N resamples = 10 
   Resampler = strat.sub 
   Mean MSE of 10 resamples in each repeat = 7.66
   Mean MSE reduction in each repeat =  75.60%


[2020-06-23 08:20:20 elevate] Run completed in 0.04 minutes (Real: 2.47; User: 1.90; System: 0.13) 

11.2.2.4 4. Plot true vs predicted

fit$plot()

11.2.2.5 Describe

fit$describe()
Regression was performed using Multivariate Adaptive Regression Splines. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.76.

11.2.3 Scenario 2: elevate + preprocess

elevate allows you to automatically run preprocess on a dataset by specifying the .preprocess argument.
In rtemis, arguments that add an extra step to the pipeline begin with a dot.
elevate’s .preprocess accepts the same arguments as the preprocess function.
For cases like this, rtemis provides helpers functions which provide autocomplete functionality so as to avoid having to look up the original function’s usage (in this case, preprocess).
We create a wide feature set and combine x and y to show how elevate can work directly on a single data frame where the last column is the output. For this example, we shall use projection pursuit regression.

x <- rnormmat(400, 100, seed = 2018)
w <- rnorm(100)
y <- x %*% w + rnorm(400)
x[sample(length(x), 60)] <- NA
dat <- data.frame(x, y)
fit <- elevate(dat, mod = 'ppr', .preprocess = rtset.preprocess(impute = TRUE))
[2020-06-23 08:20:21 elevate] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 400 x 100 
    Training outcome: 400 x 1 

[2020-06-23 08:20:21 resLearn] Training Projection Pursuit Regression on 10 stratified subsamples... 

[[ elevate PPR ]]
   N repeats = 1 
   N resamples = 10 
   Resampler = strat.sub 
   Mean MSE of 10 resamples in each repeat = 4.30
   Mean MSE reduction in each repeat =  95.83%


[2020-06-23 08:22:10 elevate] Run completed in 1.80 minutes (Real: 108.23; User: 287.55; System: 2.71) 

Notice how each message includes the date and time, followed by the name of the function being executed.
For example, above, note how preprocess.default comes in to perform data imputation before model training.
preprocess.default signifies it is working on an object of class data.frame. There is also a similar preprocess.data.table that works on data.table objects. This is an example of how R automatically choose the appropriate function depending on input type.

fit$describe()
Regression was performed using Projection Pursuit Regression. Data was preprocessed by imputing missing values using missRanger. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.96.

11.2.4 Scenario 3: elevate + decompose

elevate can also decompose a dataset ahead of modeling. We can direct elevate to perform decomposition ahead of modeling using the .decompose argument.

x <- rnormmat(400, 200)
w <- rnorm(200)
y <- x %*% w + rnorm(400)
dat <- data.frame(x, y)
fit <- elevate(dat, 'glm', .decompose = rtset.decompose(decom = "PCA", k = 10))
[2020-06-23 08:22:10 elevate] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 400 x 200 
    Training outcome: 400 x 1 
[2020-06-23 08:22:10 d.PCA] Hello, egenn 
[2020-06-23 08:22:10 d.PCA] ||| Input has dimensions 400 rows by 200 columns, 
[2020-06-23 08:22:10 d.PCA]     interpreted as 400 cases with 200 features. 
[2020-06-23 08:22:10 d.PCA] Performing Principal Component Analysis... 

[2020-06-23 08:22:10 d.PCA] Run completed in 3.4e-03 minutes (Real: 0.20; User: 0.19; System: 0.01) 

[[ Regression Input Summary ]]
   Training features: 400 x 10 
    Training outcome: 400 x 1 

[2020-06-23 08:22:10 resLearn] Training Generalized Linear Model on 10 stratified subsamples... 

[[ elevate GLM ]]
   N repeats = 1 
   N resamples = 10 
   Resampler = strat.sub 
   Mean MSE of 10 resamples in each repeat = 188.62
   Mean MSE reduction in each repeat =  14.97%


[2020-06-23 08:22:11 elevate] Run completed in 0.01 minutes (Real: 0.60; User: 0.52; System: 0.06) 
fit$describe()
Regression was performed using Generalized Linear Model. Input was projected to 10 dimensions using Principal Component Analysis. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all resamples was 0.15.