21 Demo: Simpson’s Paradox

  .:rtemis 0.8.0: Welcome, egenn
  [x86_64-apple-darwin17.0 (64-bit): Defaulting to 4/4 available cores]
  Documentation & vignettes: https://rtemis.netlify.com

We are given a dataset of just two variables and are asked to explore it. Now, suppose we are quite new to this. We might start by looking whether the two variables are correlated:

cor(x, y)
[1] -0.4469948

We see that they are negatively correlated. We always hear people talking about p-values, so after a quick google search, we return to get a p-value for our correlation.

cor.test(x, y)

    Pearson's product-moment correlation

data:  x and y
t = -7.0313, df = 198, p-value = 3.242e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.5515318 -0.3286346
sample estimates:

The correlation is highly significant!

What have we done wrong so far?
We should always begin by plotting data, if possible.
So let’s have a look. Let’s look at each variable by itself, first.



We don’t see much on the univariate plots. Or maybe we do. That first one looks like there’s two separate groups of points. Let’s look at the density plot of x:

mplot3.x(x, 'density')

Clearly, we want to plot them against each other.

mplot3.xy(x, y)

OK - well, so what was that super significant negative correlation?
Let’s plot and draw the linear fit:

mplot3.xy(x, y, fit = 'lm')

Well, sure, but that looks wrong.
Maybe we can cluster the dataset and plot again, grouping by cluster:

df <- data.frame(x, y)
xy.kmeans <- u.KMEANS(df, k = 2)
[2020-06-23 09:00:42 u.KMEANS] Hello, egenn 
[2020-06-23 09:00:42 u.KMEANS] Performing K-means Clustering with k = 2... 

[2020-06-23 09:00:43 u.KMEANS] Run completed in 0.01 minutes (Real: 0.41; User: 0.27; System: 0.02) 
mplot3.xy(x, y, group = xy.kmeans$clusters.train)

We could have done this all with a single mplot3.xy command too:

mplot3.xy(x, y, cluster = "kmeans")

Ok, now let’s draw separate fit lines. If you specify a grouping variable and request a fit, mplot3.xy will draw separate lines for each group:

mplot3.xy(x, y, fit = 'lm', group = xy.kmeans$clusters.train)

That’s pretty nice.
Now we suddenly get all fancy and want to color our points by their distance to the cluster centroid and play around with colors in mplot3.xy:

cdist <- apply(xy.kmeans$clust@cldist, 1, min)
mplot3.xy(x, y,
          marker.col = colorGrad(200, 'yelredblu')[rank(cdist)],
          point.cex = 2, marker.alpha = .5,
          theme = 'black',
          main = "Distance from cluster centroid",
          main.col = "gray70",
          box.col = 'gray50',
          tick.col = 'gray50',
          labs.col = 'gray50')

Simpson’s Paradox is often encountered in biomedical and social sciences. Read more about it here.