1. Python Data Science Handbook
By: Jake VanderPlas
Recent data shows that Python is still the leading language for data science and machine learning.
The Python Data Science Handbook is the perfect reference for boosting your Python skills.
As a data scientist you’ll often be asked to work on numerous tasks, but a majority of your time will be spent on manipulating data and data cleaning.
This is a perfect reference to keep close by for those frequent data manipulation tasks using Pandas.
Here’s a number of other important data science topics this book covers:
- IPython Shell
- Numpy for computations
- Data manipulation with Pandas
- Data visualizations with Matplotlib
- Machine learning with Scikit-Learn
Action Step: Use the data manipulation section with Pandas to clean a messy data set.
Here’s a great place for you to find messy data to work with.
2. Think Python
By: Allen B. Downey
If you’re just starting out programming with Python, this book is for you.
If you’re a more advanced Python user… this book is also for you.
Think Python reviews everything from the basics of data structures and functions, to more advanced topics such as classes and inheritance.
Every few chapters this book ties together key concepts with case studies. This is a great way to reinforce learning new concepts.
Here’s a list of just a few of the topics covered in this book:
- Functions
- Iteration
- Data structures
- Files
- Classes
- Methods
- Inheritance
Action Step: Work through the case study in Chapter 13 on data structure selection.
Flip back and forth to the previous chapters as needed, but don’t read them end to end.
This case study is a great example of how to complete a word frequency analysis.
3. R for Data Science
By: Garret Grolemund and Hadley Wickham
If you want to make yourself marketable to employers and stay current with your data science skills, you should have a good handle on R.
R is neck in neck with Python as the top programming languages for data science.
A recent poll of the data science community indicated that 52.1% of responders use R, only slightly less than 52.6% which use Python.
If you want to sharpen your R skills, R for Data Science is the perfect book.
It covers the basics for new R users, such as data cleaning, but also gets into more advanced topics as well.
Data scientists can spend up to 80% of their time cleaning data, so this is a reference you will definitely want to keep close by.
This book is a great general R reference from Hadley Wickham and Garret Grolemund, two of the top developers in the R community.
Here’s a number of topics covered:
- Exploration
- Wrangling
- Programming
- Modeling
- Communication
Action Step: Use this chapter to perform an exploratory analysis.
You can explore this housing dataset and document your findings using an Rmarkdown notebook.
Make sure you put your project on your github page and link to it from the projects section on your linkedin profile.
4. Advanced R
By: Hadley Wickham
If you really want to set yourself apart as an R user and impress employers, Advanced R is a great resource.
It covers everything from the foundations, including data structures, object oriented programming, and debugging, to functional programming and performance code.
With the development of the Rcpp package, R users can now develop performance code using R, taking advantage of the speed of C++.
One R user was able to achieve a performance speed up of over 100X using Rcpp.
If you have advanced knowledge of R and can think about production-level code, you’ll immediately make yourself more attractive to potential employers.
Action Step: Work through the Rcpp case study on R vectorization vs C++ vectorisation in the Rcpp section.
Modify the function and try some new ones.
Take your findings and write them up in an explanatory post for a portfolio project.
5. Introduction to Statistical Learning
By: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Introduction to Statistical Learning is one of the best introductory textbooks for machine learning.
It provides easy to understand explanations of concepts and coding examples with R.
It also covers the basics of linear models extensively.
It’s important to know these basics because these are some of the most common models asked about in data science interviews .
Linear models are also popular in business settings where model interpretability is important.
The effect that TV vs online ad spending has on sales is a perfect application of linear models for interpretability.
Some other additional topics covered include:
- K-fold cross-validation
- Regularization
- Feature selection
- Polynomial regression
- Tree based methods
- Support vector machines
- Unsupervised learning
Action Step: Use chapter 4 on Classification to implement a logistic regression model.
Use this credit card dataset to predict defaults.
This is a typical application for data scientists who work in risk management.
6. The Elements of Statistical Learning
By: Trevor Hastie, Robert Tibshirani, Jerome Friedman
If you want to accelerate your machine learning career, you need to have a strong grasp on both fundamentals, and advanced topics.
The Elements of Statistical Learning is the perfect resource for bringing your machine learning skills to the next level.
This is one of the most comprehensive books on machine learning.
This book reviews everything from linear methods to neural nets, boosting, and random forests.
It’s a bit more mathy than other books, which is great for gaining a deeper understanding of the topics.
Don’t try to absorb the entire book at once though. Instead, take it in small chunks.
Pick a topic in a chapter, and build a small project (don’t spend more than 8 – 10 hours).
Action Step: Read Section 3.4.3 and understand the difference between Ridge Regression and the Lasso.
Use this housing dataset to predict housing prices. Use the Scikit-Learn implementation of linear regression with all of the features, and then use Ridge Regression and the Lasso to select the most important features.
7. Understanding Machine Learning: From Theory to Algorithms
By: Shai Shalev-Shwartz and Shai Ben-David
If you want a deeper understanding of machine learning algorithms, this is a great book.
It’s split into the following sections of increasing complexity:
- Foundations
- From theory to algorithms
- Additional learning models
- Advanced theory
A great way to gain a deep, lasting understanding of machine learning topics is to implement them from scratch.
This is the perfect reference for implementing algorithms yourself.
If you haven’t used a machine learning model before, I don’t recommend implementing it from scratch right away.
Start by using scikit-learn or one of R’s libraries, and then after you’ve got a handle on it, try writing it yourself from scratch. This book provides extensive theory on the algorithms to help you.
Action Step: Read through chapter 18.2 on the decision tree algorithm, then follow along with this decision tree tutorial to write your own from scratch.
8. Mining of Massive Datasets
By: Jure Leskovec, Anand Rajaraman, Jeff Ullman
This is a great book developed from various Stanford courses on large scale data mining and network analysis.
The focus is on data-mining very large datasets.
This is important for implementing production level models at scale.
Large companies like Google receive hundreds of millions (or more) search queries per day, so they are especially interested in mining very large datasets.
Some topics covered in this book include:
- Mapreduce
- Mining data streams
- Link analysis
- Recommendation systems
- Mining social-network graphs
- Dimensionality reduction
- Large-scale machine learning
Action Step: Read through chapter 5 on Link Analysis.
There’s a great example of how Google uses the PageRank algorithm to assign a real number to a page to determine how “important” it is.
Complete exercise 5.1.1 to determine the PageRank of each page in the simplified internet model in Figure 5.7.
Use Python and Numpy to complete this exercise. Don’t forget to write it up as a portfolio project.
9. Deep Learning
By: Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Deep learning is one of the hottest fields in machine learning.
Companies like Google, Facebook, and Amazon need highly skilled professionals with expertise in deep learning.
What is it that makes deep learning so powerful?
It automates one of the most difficult parts of machine learning, feature discovery.
Rather than spending hours of time manually engineering new features in creative ways, deep learning automates the process.
If you’re new to deep learning, this book is a must.
Even if you have some experience, those advanced deep learning practitioners will benefit as well.
This book is presented in an easy to read slide format with lots of bullets and pictures.
Here are some of the topics covered:
- Intro and explanation of the importance of deep learning
- Algorithms – backpropagation, convnets, recurrent neural nets
- Unsupervised deep learning
- Attention mechanisms
Action Step: Read through the section on algorithms and then use Python’s Theano library to classify MNIST digits using a multilayer perceptron.
10. Think Stats
By: Allen B. Downey
As a data scientist, it’s important that you have a solid grasp on probability and statistics.
Machine learning models are rooted in the fundamentals of probability theory.
You’ll frequently be asked basic probability and stats questions during interviews, so it doesn’t hurt to refresh yourself from time to time.
This book is geared towards programmers, so it takes more of an applied approach rather than conventional textbooks that focus on the math and theory.
Sections are short and easy to read, so you’ll be able to quickly work through examples.
Some of the topics covered include:
- Descriptive statistics
- Cumulative distribution functions
- Continuous distributions
- Probability
- Operation and distributions
- Hypothesis testing
- Estimation
- Correlation
Action Step: Read through chapter 7 on hypothesis testing. This chapter provides a good comparison between classical hypothesis testing and Bayesian hypothesis testing.
Work through exercise 7.3 to determine the posterior probability that the distribution of birth weights is different for first babies and others.
You’ll be working with data from the National Survey of Family Growth (NSFG).
11. Bayesian Methods for Hackers
By: Cam Davidson-Pilon
This a Bayesian Statistics textbook that takes an “understanding first”, “mathematics second” point of view.
Bayesian inference is an important topic in machine learning that takes a different approach than classic inferential statistics.
The Bayesian approach allows us to make inferences about things based on what we already know.
We can never be certain about an outcome, but with some prior knowledge, we can establish some confidence about an outcome.
In a real-world setting, Bayesian statistics is applied to classification problems such as email filtering (“spam” or “not spam”) and article classification (“technology”, “sports”, or “politics”).
This is an easy to read book, with frequent examples in Python code. The book has a conversational tone, which keeps things interesting.
Some topics include:
- Bayesian methods
- Modeling Bayesian problems using Python
- Markov Chain Monte Carlo
- The law of large numbers
- Loss functions
- Choosing appropriate prior distributions
Action Step: Read through the example in Chapter 2 on Bayesian A/B testing. This is a great example of a real-world application.
A/B testing is especially popular in online marketing (“does version A of a website get more sales than version B of the website?”).
Code this yourself in Python, and play around with the number of trials, N, to see how the posterior distribution changes.
12. Think Bayes – Bayesian Statistics Made Simple
By: Allen B. Downey
Another great resource from Allen Downey and Green Tea Press.
This book takes a logical approach to solving problems.
The author uses numerous examples to show you the types of decisions you’ll need to make when modeling real-world problems.
Here are some of the topics included in this book:
- Bayes’s Theorem
- Computational statistics
- Decision analysis
- Observer bias
- Hypothesis testing
- Dealing with dimensions