Getting into data science

A while back I had the pleasure to address a team of user experience researchers at YouTube, and I got asked for a few resources that could help someone pretty good at science, math, and programming who wanted to get into data science. Here’s the list I gave. These have worked for me in the past, with the caveat that I’m very partial towards books.

Absolute must-reads

An Introduction to Statistical Learning 
Python Data Science Handbook

Both are freely available, outstanding books that cover a LOT of ground. The former uses R and goes somewhat deeper in theory, while the latter uses Python and is perhaps more practical, covering iPython, Numpy, and the scikit-learn ecosystem.

Great too

Learning Statistics with R

One of the clearest expositions of fundamental statistical concepts I’ve read. It’s also well written and avoids dry, lifeless prose; the author does a great job at discussing the pros and cons of each technique, and frequently gives templates on how to present the results. One of the most memorable passages was his/her (read the text to understand…) rant against the use of p-values AFTER looking at the data. Free book.

R for Data Science

Hadley Wickam’s companion book to the tidyverse. Essential reading if you’re into R and use the tidyverse. More oriented towards data manipulation and programming than actual statistical modeling. Free book.

For the brave

The Elements of Statistical Learning

The “grown-up” version of ISLR (mentioned above). Covers a lot of theoretical ground, including a great discussion of the variance-bias tradeoff so beloved of interviewers. That book taught me to stop blindly normalizing covariates before running clustering algorithms.

Regression Modeling Strategies

Harrell is to statistics what Wickham is to data manipulation: the opinionated author of some amazing R packages that do a better job than the ones provided in base R. It’s a very dry text though, and probably better read in conjunction with some explanatory blog posts. Furthermore, it can be difficult to find resources online because these packages are not as widely adopted as the tidyverse.

Summer reading

Data Science from Scratch

Joel Grus is amazing. In this book he shows how to code (and test!) many constructs used in Data Science, culminating with a pseudo-relational database.

Oh you think you know statistics?

Statistical Evidence
Causal Inference in Statistics: A Primer

I’m including these two books because I think reading them will make you a better statistician. The former is a short but mind-blowing read that will make you rethink every analysis you’ve ever done. The latter is the must-read text if you’re going to do any kind of causal inference.

Non-book resources

Machine Learning

Deep Learning

AI nanodegree

These are some online courses I’ve taken and which I can wholeheartedly recommend, especially the first one which covers pretty much most concepts used in DS / ML. The Deep Learning specialization is more oriented towards neural networks, while Udacity’s AI nanodegree has probably nothing to do with DS but is a great intro to topics like building game-playing AI or path-finding algorithms.

Am I missing something? Feel free to add your own recommendations in the comments below.

The law of total probability applied to a conditional probability

Dear future self,

I’ve just lost (again) about half an hour of my life trying to find a vaguely remembered formula that generalizes the law of total probability to the case of conditional probabilities. Here it is. You’re welcome.

So what is the probability of dying from a lighting strike if you’re an American who knows this statistic?

The law of total probability says that if you can decompose the set of possible events into disjoint subsets (say $B$ and $\overline{B}$), then (with obvious generalization to more than two subsets):

$$\Pr(A) = \Pr(A \mid B) \Pr(B) + \Pr(A \mid \overline{B}) \Pr(\overline{B})$$

But what if you’re dealing with $\Pr(A \mid C)$ instead of just $\Pr(A)$? What’s the formula for the law of total probability in that case? What you’re searching for can be found by googling for “total law probability conditional”:

$$\Pr(A \mid C) = \Pr(A \mid B, C) \Pr(B \mid C) + \Pr(A \mid \overline{B}, C) \Pr(\overline{B} \mid C) $$

There’s a great derivation here: https://math.stackexchange.com/questions/2377816/applying-law-of-total-probability-to-conditional-probability.