XKCD on Data Science

I’ve been collecting all XKCD comics related to Data Science and/or Statistics. Here they are, but if you think I’m missing any please let me know in the comments. Use at will in your data visualizations but remember to attribute. Sorted in reverse chronological order.

Effect Size
K-Means Clustering
Methodology Trial
Euler Diagrams
Data Point
Change in Slope
Proxy Variable
Health Data
Garbage Math
Selection Bias
Spacecraft Debris Odds Ratio
Control Group
Confounding Variables
Bayes’ Theorem
Slope Hypothesis Testing
Flawed Data
Error Types
Modified Bayes’ Theorem
Curve-Fitting
Machine Learning
Linear Regression
P-Values
t Distribution
Increased Risk
Seashell
Log Scale
Cell Phones
Significant
Conditional Risk
Correlation
Boyfriend

Quick note about bootstrapping

Cross-validation—the act of keeping a subset of data to measure the performance of a model trained on the rest of the data—never sounded right to me.

It just doesn’t feel optimal to retain an arbitrary fraction of the data when you train your model. Oh and then you’re also supposed to keep another fraction for validating the model. So one set for training, one set for testing (to find the best model structure), and one set for validating the model, i.e. measuring its performance. That’s throwing away quite a lot of data that could be used for training.

That’s why I was excited to learn that bootstrapping provides an alternative. Bootstrapping is an elegant way to maximize the use of the available data, typically when you want to estimate confidence intervals or any other statistic.

In “Applied Predictive Modelling“, the authors discuss resampling techniques, which include bootstrapping and cross-validation (p. 72). The authors explain that bootstrap validation consists in building N models with bootstrapped data and estimating their performance on the out-of-bag samples, i.e. the samples not used in building the model.

I think that may be an error. I don’t have Efron’s seminal book on the bootstrap anymore but I’m pretty sure the accuracy was evaluated against the entire data set, not just the out-of-bag samples.

In “Regression Modelling Strategies“, Frank Harrell describes model validation with the bootstrap thus (emphasis mine):

With the “simple bootstrap” [178, p. 247], one repeatedly fits the model in a bootstrap sample and evaluates the performance of the model on the original sample. The estimate of the likely performance of the final model on future data is estimated by the average of all of the indexes computed on the original sample.

Frank Harrell, Regression Modelling Strategies