What is Linear Regression?
It’s a way to establish a relationship between one or more variables that describes a behavior that is predictable.
I know that’s very vague, but I’m keeping it that way on purpose for now. That established relationship can be expressed via a graph, where the dimensions of the graph (x,y, etc) describe how the variables are being compared.
I would like to think our brains likely do some form of this naturally, as we know from memory “how much something usually costs” and are surprised when we estimate the value of something and its value is either much higher or much lower than we expected. Certain of our intuitions are probably some natural variant of regression, such as guessing where fruits and vegetables are likely to occur, where water is likely to occur, and maybe even where danger is likely to occur.
In this post, I’ll explore Simple Linear Regression and its relationship to Machine Learning.
What is Regression?
From Dictionary.com:
A return to a former or less developed state
a measure of the relation between the mean value of one variable (e.g. output) and the corresponding values of other variables
I’ve worked in software for a bit over a decade, and my use of the word regression is usually referring to when a feature that was working before a code change was implemented, and afterward bugs were discovered. Software regression is probably one of the most common bugs in a given app.
It’s interesting linguistically, but I’m interpreting the mathematical term regression as ‘retreating back to a norm.’ A correction on anomalous behavior. This poses several questions for me regarding societal norms, but that’s for another post, and a time in my life when I have a better understanding of both of those topics.
In statistics, Regression can be understood as identifying relationships between independent and dependent variables. In ML, the term regression describes the modeling technique used to estimate the relationship but is more focused on the prediction outcome instead of determining causality between the variables.
What Makes Linear Regression ‘Linear’?
It has everything to do with the parameters. The relationship defined in the regression doesn’t necessarily have to be linear. A quick math example:
y = [β₀] + [β₁ * x₁] + [β₂ * x₂]
Above, we see that it’s summed variable type 1 * variable type 2 over a series. That is a linear function, despite whether or not the relationship between the two variables actually does end up resulting in a linear plot.
This confused me because I was expecting straight lines each time. I’m not currently aware of what is more common, as I’m still actively learning this stuff.
An Example
I used randomized data above so the graph doesn’t do the justice it could do. But on the other hand, this might not be super atypical in certain housing markets. We see the regression line above in red, angled up and increasing with square feet as it increases. The relationship is more square feet means more cost, and we can see that we are capable of predicting that sort of behavior pretty accurately as the randomized pattern trends upward at a nearly identical angle as the line of regression.
Take away:
The graph above gives an insight that there seems to be the capability to predict that prices of homes will be higher if the square footage of the home is higher. It’s not directly responsible for the increase, and other variables may correlate more strongly.
Conclusion
This is a dramatic over-simplification of what Linear Regressions are. But it’s at least enough detail to understand what can be accomplished using them, and a little introduction to what they are. I’ll of course be using them more over the next months while I’m learning more about how to get work done with ML systems.