Backpropagation in ML Systems
Backpropagation is an unavoidable term when studying Deep Learning. And when I notice these terms that I don’t understand, I can only encounter so many before I feel lost and need to backtrack. This post is the backtracking to better understand what is happening further along in the process.
I want to get more comfortable what Backpropagation is and what it accomplishes. So ahead is a shallow-ish dive into Backpropagation: What and Why.
Backpropagation is an Algorithm
And if I were to take a stab at specifically what Backpropagation does, I would summarize it into two items:
Update the relationships between nodes in a network to have a more consistent output
Reduce the performance cost across the training data and thus also hopefully improve the performance of the model on real-world data.
I don’t know that I can correctly describe HOW this works without using a bunch of math. But I’ll give it a shot:
During training, Data flows “forward” through a network and produces various outputs. Once these outputs are generated, they are collected and a backward traversal through the network is initiated. The outputs from that specific iteration are used to tune weights that are assigned to the relationships between specific nodes in the network, as the backward traversal moves along.
So it turns out that the term Backpropagation is about as literal as you can get. It describes the ‘crawl’ of a process backward across the network of nodes, and its impact on the relationships between the nodes as it moves along.
How does the system measure “error”?
There are multiple ways of finding this error variance, but a common way of doing it is the Loss Function / Cost Function. This function is used in all sorts of ways, in a variety of algorithms. For example, I have a post about Linear Regression, and the process of linear regression also uses a loss function when creating the regression line.
This explanation is general and slightly wrong, but go with me for a second because the general idea is still accurate.
In the process of training, there will be a rolling average of what the output expectation will be. The loss function is used to find the difference between the last output, and the rolling average. Using that difference measurement, it iterates back through the network and tunes relationships to slightly skew the output so that it’ll be closer to the rolling average on the next iteration.
Contrary to my intuition this seems to largely improve prediction accuracy. I would think that the added diversity in larger data sets would cause these sorts of things to skew and the average to just become “more frequently more wrong” on each adjustment, but I suppose the models are capable of establishing ‘considerations’ for these variables without understanding them. They seem to brute-force-classify these outlier-producer properties and have the output weighted to handle that.
It’s an insane thing. But I suppose our brains likely function in a similar way.
It does have a biological basis
It seems like Backpropagation might not just be some algorithm invented by some math nerd, and that it may even be some sort of universal phenomenon that describes behavior that is inherent in the mind, and potentially even other elements of natural selection.
Backpropagation has been discovered independently multiple times over the decades. From the 40’s to the 80’s this phenomenon was noticed and written about by various folks, only finding real adoption in the computer science world after computing power had enough time to be actually capable of carrying out the work needed for it to be effective in its implementation.
I don’t know enough to continue on this topic but I find it extremely interesting that this natural phenomenon that is likely happening in our own minds on multiple levels is also a driving factor in creating the ability to synthetically predict outcomes with greater precision than the human mind in many cases.
Conclusion
Backpropagation is the act of tuning relationships (updating weights) between nodes in the layers of a model designed to make predictions or produce outputs for data sets of a specific type. This process enables a type of learning where the quality of the output can be improved with a more diverse data set with a larger number of features, as opposed to more narrow slices that are good at identifying only specific correlations or anomalies.