Correlation vs. Causality
The classic statement of Correlation is not indicative of Causation on its own. This post was inspired by Kaggle.com’s correlation vs causation lesson.
Correlations
When evaluating data, especially enormous data sets, it’s very likely that correlations will be noticed. The way to look at correlations is:
Given two variables, if coincidentally they change together, either inversely or congruently, this is a correlation.
Of the two (or more) variables changing, neither may directly cause the other variable to change. The number of pirates on the planet does not cause the temperature of the planet to decrease, even though if you look at the data from the past 100 years you will see that those two things correlate inversely together over time. And vice versa hotter weather does not mean fewer pirates on it’s own.
(edit: it’s not impossible to say that to some extent, planetary temperature causing various socio-economic phenomena could have some window of causality on the number of people willing to steal via sea vessels to survive. So more or less pirates, because the planet is hotter, could actually be causal at certain levels)
So, to know the difference beween correlations and causations, let’s know for sure what they both are.
Identifying Correlations
Of the ways to do this, the one I found that would be pretty bullet-proof is called Pearson's correlation coefficient, or simply the correlation coefficient.
The coefficient of correlation is a number ranging from -1 to 1. Scoring -1 indicates a perfect negative correlation, and scoring a one indicates a perfect positive correlation. Either of these would mean there are no examples of instances where to variables don’t correlate in the same manner as the other variable changes.
If I’m thinking about the things I want to know about a correlation, the first three things are:
The strength of the relationship
The direction of the relationship ( Positive or Negative )
Compare the relationships (across datasets use the same ‘measurement’)
The correlation coefficient will actually make it possible to find these things, given that you have established beforehand the independent and dependent variables. I suppose you could just trial and error it with two columns from a DataFrame, and stumble upon insights. And why not!? Accidents sometimes make for the greatest discoveries.
Either way, being able to find those points listed above will definitely prove correlation to a sufficient enough degree to take further action.
Causality
Now we get to the point where it’s time to understand if a correlation is something more. Is one of the variables we are comparing the cause of the other? How can we know this? You would be right in guessing that even finding extremely strong correlations using the above method is not enough. There needs to be more.
That said, correlation itself is definitely a big part of determining causality, which is likely one of the reasons the two terms are frequently misunderstood.
Determining Causality is Hard
Finding correlations is almost as easy as reading the measurements. A lot of the time correlations pop right out, and it’s very difficult to ignore them even. But the three criteria that must be met in order to mostly or completely prove causality are the following:
Correlation
Temporal Precedence
Non-Spuriousness
Our Old Friend Correlation
For something to be an effect of a cause, it must repeatably happen. If you flip a light switch, you expect the light to turn on. Understanding that the switch itself is what interrupts current to the bulb is understanding the causal nature of lights-on vs. lights-off in the casual relationship between the light switch and the light bulb being on or off.
An important note about correlation as a ratio: A 1:4 ratio doesn't necessarily rule out causation. It might just mean that there are other factors involved (e.g., a faulty wiring intermittently interrupting the circuit, a bulb that's reaching the end of its life). Causation can exist without a 1:1 ratio if there are other causal factors involved.
Temporal Precedence
Outside of quantum phenomena, we expect all things in the macro world to have the cause happening before the effect. IF the triggering event does not happen prior to the correlation, it cannot be said to be the cause. Sounds like common sense but it’s a part of the requirements.
Non-Spuriousness
How certain are you that the tests for causality encompass all the variables that can trigger a specific correlation? The relationship between the cause and the effect cannot be due to a third variable. In other words, there's no other plausible other variable being left out that could be causing the effect. This elimination process is a requirement, else you cannot say with certainty that an effect is the result of some specific cause.
Conclusion
While correlations may reveal interesting relationships between variables, they do not on their own demonstrate causality. Determining true causation requires meeting additional, more stringent criteria beyond just a correlation - namely temporal precedence of the cause happening before effect, and non-spuriousness ruling out other variables as alternative causes.
Correlation and causation are not interchangeable terms, and each has distinct meanings - correlation being the mutual relationship between two variables, and causation being the mechanism by which one variable directly influences another.
As an ML Engineer, I’ll need to resist inferring causation solely from observed correlations no matter how strong the relationship appears. Proving causation independently, with adequate testing and procedures is an entire set of independent steps. That said, discovering correlations is an important first step toward uncovering potential causal avenues worth investigating further through rigorous statistical analysis.