A while back I had the pleasure to address a team of user experience researchers at YouTube, and I got asked for a few resources that could help someone pretty good at science, math, and programming who wanted to get into data science. Here’s the list I gave. These have worked for me in the past, with the caveat that I’m very partial towards books.
Absolute must-reads
Both are freely available, outstanding books that cover a LOT of ground. The former uses R and goes somewhat deeper in theory, while the latter uses Python and is perhaps more practical, covering iPython, Numpy, and the scikit-learn ecosystem.
Great too
One of the clearest expositions of fundamental statistical concepts I’ve read. It’s also well written and avoids dry, lifeless prose; the author does a great job at discussing the pros and cons of each technique, and frequently gives templates on how to present the results. One of the most memorable passages was his/her (read the text to understand…) rant against the use of p-values AFTER looking at the data. Free book.
Hadley Wickam’s companion book to the tidyverse. Essential reading if you’re into R and use the tidyverse. More oriented towards data manipulation and programming than actual statistical modeling. Free book.
For the brave
The “grown-up” version of ISLR (mentioned above). Covers a lot of theoretical ground, including a great discussion of the variance-bias tradeoff so beloved of interviewers. That book taught me to stop blindly normalizing covariates before running clustering algorithms.
Harrell is to statistics what Wickham is to data manipulation: the opinionated author of some amazing R packages that do a better job than the ones provided in base R. It’s a very dry text though, and probably better read in conjunction with some explanatory blog posts. Furthermore, it can be difficult to find resources online because these packages are not as widely adopted as the tidyverse.
Summer reading
Joel Grus is amazing. In this book he shows how to code (and test!) many constructs used in Data Science, culminating with a pseudo-relational database.
Oh you think you know statistics?
I’m including these two books because I think reading them will make you a better statistician. The former is a short but mind-blowing read that will make you rethink every analysis you’ve ever done. The latter is the must-read text if you’re going to do any kind of causal inference.
Non-book resources
These are some online courses I’ve taken and which I can wholeheartedly recommend, especially the first one which covers pretty much most concepts used in DS / ML. The Deep Learning specialization is more oriented towards neural networks, while Udacity’s AI nanodegree has probably nothing to do with DS but is a great intro to topics like building game-playing AI or path-finding algorithms.
Am I missing something? Feel free to add your own recommendations in the comments below.
4 thoughts on “Getting into data science”
Comments are closed.
Great list! I would also add:
– Practical Statistics for Data Scientists (as an introduction to a lot different statistical methodologies)
– Data Science for Business (that makes a good job in explaining how to think about business problems with data // definitely recommended for people who work with DS)
As fate would have it I recently asked my manager for recommended resources on how to better serve the business as a data scientist. Your recommendation for Data Science for Business comes very appreciated, I’ve gone and borrowed it yesterday.
This list is excellent! I’ll have to tell my students about this!
Glad to hear you liked it, let me know if there’s any particular kind of content you’d like me to write more about!