A few years ago (2013), I got my Ph.D. by writing a fairly boring dissertation about a fairly interesting topic: how the quality of local schools affect the prices of homes. Then and now my go-to data analysis tool was/is regression analysis. Now that’s a fairly broad field of tools (there’s linear regression, logistic regression, spatial regression, etc.) but on the whole, they all work fairly similarly when it comes down to it. You’re trying to understand the relationship between some variables by modeling them in some mathematical way. For example, in my school quality and house prices example, maybe we’d see a somewhat simple model that looks like this:
\(Price = \beta_0 + \beta_1 Sqft + \beta_2 Bedrooms + \beta_3 Baths + \dots \)
In this case, we’re saying that the square footage of a home, the number of bedrooms, the number of bathrooms, and some other variables that I didn’t show probably affect the price of the home when they change. The entirety of most of empirical economics is trying to correctly estimate those betas up there because they tell us how much the variable on the left-hand side of the equation changes when one on the right changes. That is, economists usually are trying to measure a “causal” relationship: variable X changes and this causes variable Y to change by some amount Z. Making sure our measure of Z is more or less correct (unbiased in stats terminology) is usually the end goal.
However, not everyone wants to measure that causal relationship. Sometimes correct predictions are far more important that understand the true nature of the model. I would argue that is the fact in most business cases.
A few years ago, I began hearing (as I think most people did) about “machine learning”. It was revolutionizing data analysis and prediction. About the same time the profession of “data science” was getting hot. If you ask 100 people what data science is, you’re going to get 100 different answers. However, the best definition I’ve heard so far has been “someone who knows more about programming than the normal statistician (or economist) and more about statistics than the normal programmer”. That’s because a lot of “data science” is done by writing (sometimes complex) programs/scripts that run an analysis on a data set.
Usually, in the cases of machine learning and data science, good predictions are far more important than understanding the underlying mechanism that leads to those predictions. Don’t get me wrong, a good model should also be fairly interpretable, but at the end of the day if I can get a very accurate prediction of the price of the home without really understanding why the model predicted that price, I’m still pretty happy.
So, this project is going to be data science from an economist’s perspective. I’m going to be working my way through some data science texts and posting about the topics, along with code to demonstrate the material. I have a fairly eclectic programming background but as far as I can tell Python has become the language of choice for most of the work in data science, so that’s what I’ll be using most of the time.
I’m first going to tackle An Introduction to Statistical Learning (ISL) by James, Witten, Hastie and Tibshirani. All of the code in ISL is written in R, another great language for statistical programming. However, as I said above, I’ll be using Python and so my code and output might be somewhat different.
Without further to-do, let’s get going.