Learning materials for Career Transition to Data Science

Career transition to data science is no easy feat. Here is a list of learning materials (books, documentations, Cousera courses, etc.) that I use to build my data science skills, with a minimalist approach (i.e., I avoid reading an entire book if I can use a few only online courses, although sometimes, reading a book is inevitable). I will keep on updating this page.

Python

Basic python syntax: I took a class at my school. I use The Quick Python Book by Naomi Ceder as reference later on.

Introduction to Data Science in Python: My first exposure to data science with Python. I worked through the course twice. This is why I say data science can be quite disorienting even for people who have programmed before. Overall, I think it’s good. It shows what Python data science is, not just programming in python language.

Pandas user guide, Numpy user guide: As a beginner, I did “google-oriented” and “stack overflow-oriented” programming. However, at some point, reading through a documentation is worthwhile. Both can be scanned through in a few days. Recommend if you have already programmed for a while.

Machine Learning

As the first step, I recommend completing a basic machine learning project first (data cleaning, train-test splitting, modeling building, and performance assessment), if you are familiar with Python and pandas. I did that in the Text Mining class taken at my school. You can find tones of such examples on the web. Reading anything at this point only delays your hands-on experience.

Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn. After some informal exposure to ML, I highly recommend this book. This makes sure you understand the methodology, and have a framework in your mind about what are good practices for machine learning. Basically, many materials tell you how to do ML, this book tells you why.

R

I used Hardley Wickham and Garret Grolemund’s R for Data Science as my beginner material.

Statistics

Think Stats: Free online book and based on Python

Environment/package management

For Reproducibility and the longevity of your project! This is something I found super useful but no one teaches. For python, I studied it using Carpentries’ Lesson introduction to Conda for (data) scientists. It takes a day or two to go through, but definitely worth it.

For R, a latest advances is to use renv, an environment management package for R. Also, read this page to get a big picture idea about reproducibility and collaboration in R.

Natural Language Processing

(to do)

Version Control

I recommend a series of classes developed by Code Refinery.

Time Series Analysis

(to do)