7 Minute Read

Published Nov 2022

This blog was created to deepen my understanding on correlation. I felt I understood it in the logical sense, but my intuition wasn’t quite there. As I’ve been getting deeper into machine learning, I’ve felt it’s important my fundamentals are grounded well. This is the result of my exploration in a presentable way.

↓ Click to expand


If its snows, how likely is there to be ice? You might say it’s very likely. You just commented on the correlation of snow and ice. Pearson’s correlation coefficient gives you a number that indicates this based on data on times it’s snowed and the temperature. Essentially, given and independent and dependent variable, (ex. snowfall and ice), how closely are they correlated?

Dangers of correlation

It is important to note that the correlation coefficient is useless on it’s own. Without further quantitative analysis like outlier detection and qualitative analysis such as common sense you’ll believe that people consuming cheese and dying from their bedsheets is correlated.

r=0.947. More interesting correlations on https://www.tylervigen.com/spurious-correlations

r=0.947. More interesting correlations on https://www.tylervigen.com/spurious-correlations

Outliers

The correlation coefficient will be wildly skewed if you have outliers in your data - when calculating r, data scientists typically remove any outliers. However, in practice, there is no such thing as an outlier. When describing real world phenomena, outliers are a part of the equation. In machine learning, data collection, cleaning and understanding is arguably more important than the machine learning problem itself. Typically you can learn a lot by asking questions about outliers -

In a machine learning problem you never remove outliers, you should just consider them as a different regression problem and describe them with another function. But why is r sensitive to outliers? To understand, we must break down the equation above.

The equation

Let’s start with a set of points that we can use to calculate the equation.