How hard is it? What does it take? Where do I start?
In this blog I’ll summarize the 3 hardest challenges I faced doing my first Data Science project in this Kaggle Notebook:
- You know nothing
- Data preparation is critical and time-consuming
- Interpret your results
Opinions are my own.
Before getting into any details, there is quite an essential part that people seem to gloss over during explanations, or they’re simply part of their small snippet codes. In order for you to use any of the advanced libraries, you are going to have to import them into your workspace. It is best to collect them at the top of your workbook.
My example below:
You know nothing
For my first Data Science project, I created a short blog on starting an Airbnb in Amsterdam. I only used basic data analysis methods and regression models.
Regression models are probably the most basic parts of data science. A typical linear regression function will look like this:
This will return several outputs that you will then use to evaluate your model performance.
There are more than 40 techniques used by data scientists. This means I only used 2.5% of all the models out there.
I’ll be generous. Given my statistics course in University 8 years ago, going into this project I knew about 10% of that regression model already. That means I knew 0.25% of the entire body of knowledge that I know is out there. Then add a very large amount of things I don’t know I don’t know.
My knowledge universe in Data Science looks something like this:
Image by Author
Like that isn’t bad enough, you will find articles like these, exactly describing all of your shortcomings.
This current project took me about 4 weeks and let’s say that’s a pretty average rate of learning new data science models. It will take me about 4 / 0.25% = 800 weeks to learn all the models I have heard of so far and probably add another 5 times that time to learn (probably not even close to) everything in the data science field.
Between 15 and 75 years of learning ahead.
Me after my learning:
I’ve worked with data for over 5 years at Google.
Too bad all my previous experience is in SQL and Data Scientists are big fans of Pandas. They’re just such animal lovers.
The challenge here is two-fold: 1) Knowing what to do, 2) Knowing how to do them.
Even with the help described below, the data preparation part takes about 80% of your time or more.
Knowing what to do
The ways to manipulate your data to be ready for ingestion in your models are endless. It goes into the deep underbelly of Statistics and you will need to understand this thoroughly if you want to be a great Data Scientist.
Be prepared to run through these steps many times. I’ll give a couple of examples that have worked for me for each step.
Clean data quality issues
Your data sample size permitting you should probably get rid of any NaN values in your data. They can not be ingested by the regression model. To find the ratio of NaN values per column, use this:
To drop all rows with NaN values use this:
Another data quality problem I ran into was having True and False data as strings instead of a Boolean data type. I solved it using this distutils.util.strtobool function:
Please don’t assume I actually knew how to use Lambda functions before starting this project.
It took me a lot of reading to understand them a little bit. I really liked this article on “What are lambda functions in python and why you should start using them right now”.
Finally, my price data was a string with $ signs and commas. I couldn’t find an elegant way to code the solution, so I botched my way into this:
Check if there are any outliers in the first place, a boxplot can be very helpful:
Get fancy and use a modified z value (answer from StackOverflow) to cut your outliers:
Or enforce hardcoded conditions to your liking, like so:
df = df[df['price'] < 300]
Normalization & combine variables
Not to be confused with a normal distribution.
Normalization is the process of scaling individual columns the same order of magnitude, e.g. 0 to 1.
Therefore, you have to only consider this move if you want to combine certain variables or you know it affects your model otherwise.
There is a preprocessing library available as part of Sklearn. It can be instrumental in delivering on some of these data preparation aspects.
I found it hard to normalize certain columns and then neatly put them back into my DataFrame. Therefore I’ll share my own full example of combining beds/bedrooms/accommodates variables below:
Create normal distribution for your variables
It’s useful to look at all the data input variables individually and decide how to transform them to better fit a normal distribution.
I’ve used this for loop:
Once you know which you want to transform, consider using a box cox transformation like this:
Note how it is important to also return the list of lambdas that were used by the box cox transformation. You’ll need those to inverse your coefficients when you are interpreting the results.
The StandardScaler assumes your data is already normally distributed (see box cox transformation example above) within each feature and will scale them such that the distribution is now centered around 0, with a standard