18 Regression with Splines

  .:rtemis 0.8.0: Welcome, egenn
  [x86_64-apple-darwin17.0 (64-bit): Defaulting to 4/4 available cores]
  Documentation & vignettes: https://rtemis.netlify.com
library(splines)
library(splines2)

18.1 Synthetic data

Let’s create some synthetic data:

set.seed = 2018
x <- rnorm(500)
y <- x ^ 3 + 4 + rnorm(500)

18.2 GLM

Let’s regress y on x:

mod.glm <- s.GLM(x, y)
[2020-06-23 08:44:42 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 1 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:44 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 9.27 (51.99%)
   RMSE = 3.05 (30.71%)
    MAE = 1.80 (13.98%)
      r = 0.72 (p = 2.3e-81)
    rho = 0.70 (p = 0.00)
   R sq = 0.52

[2020-06-23 08:44:44 s.GLM] Run completed in 0.04 minutes (Real: 2.61; User: 1.25; System: 0.13) 

As expected, this is a bad fit.

18.3 B-splines

Let’s build B-splines for x and their first derivatives and plot them against x:

x.bs <- bSpline(x, 3)
dx.bs <- deriv(x.bs)
mplot3.xy(x, x.bs, type = 'l', lwd = 3)

mplot3.xy(x, dx.bs, type = 'l', lwd = 3)

Now let’s regress y on the B-splines we built from x

mod.glm.bs <- s.GLM(x.bs, y)
[2020-06-23 08:44:45 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 3 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:45 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 1.06 (94.49%)
   RMSE = 1.03 (76.53%)
    MAE = 0.82 (60.91%)
      r = 0.97 (p = 1.2e-315)
    rho = 0.70 (p = 0.00)
   R sq = 0.94

[2020-06-23 08:44:46 s.GLM] Run completed in 0.01 minutes (Real: 0.43; User: 0.12; System: 0.02) 

We get a much better fit by regressing y on the b-splines of x.

18.4 C-splines

Let’s build C-splines for x and their first derivatives and plot them against x:

x.cs <- cSpline(x, 3)
dx.cs <- deriv(x.cs)
mplot3.xy(x, x.cs, type = 'l', lwd = 3)

mplot3.xy(x, dx.cs, type = 'l', lwd = 3)

Now let’s regress y on the C-splines we built from x

mod.glm.cs <- s.GLM(x.cs, y)
[2020-06-23 08:44:47 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 3 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:47 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 6.38 (66.97%)
   RMSE = 2.53 (42.53%)
    MAE = 1.34 (35.63%)
      r = 0.82 (p = 7.2e-122)
    rho = 0.70 (p = 0.00)
   R sq = 0.67

[2020-06-23 08:44:47 s.GLM] Run completed in 4.2e-03 minutes (Real: 0.25; User: 0.11; System: 0.02) 

18.5 I-splines

Let’s build I-splines for x and their first derivatives and plot them against x:

x.is <- iSpline(x, 3)
dx.is <- deriv(x.is)
mplot3.xy(x, x.is, type = 'l', lwd = 3)

mplot3.xy(x, dx.is, type = 'l', lwd = 3)

Now let’s regress y on the I-splines we built from x

mod.glm.is <- s.GLM(x.is, y)
[2020-06-23 08:44:48 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 3 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:48 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 3.74 (80.62%)
   RMSE = 1.93 (55.98%)
    MAE = 1.13 (45.94%)
      r = 0.90 (p = 1.4e-179)
    rho = 0.68 (p = 0.00)
   R sq = 0.81

[2020-06-23 08:44:49 s.GLM] Run completed in 0.01 minutes (Real: 0.42; User: 0.13; System: 0.02) 

18.6 M-splines

Let’s build M-splines for x and their first derivatives and plot them against x:

x.ms <- mSpline(x, 3)
dx.ms <- deriv(x.ms)
mplot3.xy(x, x.ms, type = 'l', lwd = 3)

mplot3.xy(x, dx.ms, type = 'l', lwd = 3)

Now let’s regress y on the M-splines we built

mod.glm.ms <- s.GLM(x.ms, y)
[2020-06-23 08:44:50 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 3 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:50 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 1.06 (94.49%)
   RMSE = 1.03 (76.53%)
    MAE = 0.82 (60.91%)
      r = 0.97 (p = 1.2e-315)
    rho = 0.70 (p = 0.00)
   R sq = 0.94

[2020-06-23 08:44:50 s.GLM] Run completed in 0.01 minutes (Real: 0.42; User: 0.12; System: 0.02) 

18.7 Natural cubic splines

Let’s build natural cubic splines for x and plot them against x:

x.ns <- ns(x, 3)
mplot3.xy(x, x.ns, type = 'l', lwd = 3)

Now let’s regress y on the natural cubic splines we built

mod.glm.ns <- s.GLM(x.ns, y)
[2020-06-23 08:44:51 s.GLM] Hello, egenn 

[[ Regression Input Summary ]]
   Training features: 500 x 3 
    Training outcome: 500 x 1 
    Testing features: Not available
     Testing outcome: Not available

[2020-06-23 08:44:51 s.GLM] Training GLM... 

[[ GLM Regression Training Summary ]]
    MSE = 2.15 (88.86%)
   RMSE = 1.47 (66.63%)
    MAE = 1.07 (48.79%)
      r = 0.94 (p = 1.6e-239)
    rho = 0.58 (p = 0.00)
   R sq = 0.89

[2020-06-23 08:44:52 s.GLM] Run completed in 4.1e-03 minutes (Real: 0.25; User: 0.11; System: 0.02)