No Propagation: Training Neural Network without Backpropagation
Part 1: Just the Mathematics
Alright. It’s been a few weeks since my last article. I’m sorry for taking a break. I want to say it was because I’m busy preparing for my masters, getting my registration and enrollment into the university settled. That would be the ideal reason to give right?
But apparently that is not the case. This long pause between my article actually happens because I lost a little bit of my drives and my motivation. No, don’t worry. It’s not because I’m stressed, depressed, lonely, sad or anything like that. It happens because I’ve achieved something that I can only dream of when I started this journey. I’ve been in the national television.
Yeah, apparently, last 2 weeks I’ve been given the opportunity to talk about my book (AI untuk Pemula) that I wrote with my colleagues. That sliver of fame suddenly cause me to think that I have achieved it all. I need to have a real break to convince myself that the journey is just beginning. So, here I am today, after a few weeks break, writing my next article on AI. And this time, I will cover a new paper that’s been released last month, called No Propagation
As Usual, the Abstract
The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations — at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10,and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.
Yeah, this paper actually introduce a very revolutionary ideas. For a very long time, since it’s popularization by Geoffrey Hinton, Backpropagation has been the accepted standard and methods to train a neural network. Yeah, there’s been some variations to it. For example, Backpropagation Through Time (BPTT) that is designed for neural network with hidden states/ sequential neural network. There are differences between this variations, but mostly the basic tenets are almost the same.
The neural network layer that you stacked on top of each other will go backward from the error signal that they received through loss function and will propagated backwards through each layer. Each layer which consists of learnable parameter will be adjusted based on their contributions to the overall error.
Sounds easy to understand right?
But based on this paper, the backpropagation method actually introduces several issues during training of the neural network. Among them are:
High computational overhead due to the gradient has to be carried over and stored during training. This is an issue in itself as I have unintentionally covered in my implementation of STDE paper. https://medium.com/@maercaestro/stde-stochastic-taylor-derivative-estimator-the-winning-neurips-2024-paper-from-singapore-79a7ccc3dbfc
The second issue as mention in the paper is backpropagation is biologically implausible. This is a known issue. Something that has been try to resolved by Geoffrey Hinton himself. But he seems unable to resolved it. He knows that we as humans don’t learn through backpropagation. So, he try to correct learning algorithms that try to mimics learning by neuron, but he seems unable to do so.
Finally, the backpropagation happens in sequence. Meaning that you carry that error signal and that gradients from output layer to the input. But that sequential nature of it can cause issues and lead to forgetting. This has also prevent parallelism computation. So researchers have been actively trying to resolve this one.
So, that’s why this paper introduce No Propagation. A new method of training neural network without backpropagation. They’re taking inspiration from diffusion methods where each layer try to find some ways to denoise a noisy target.
How are they doing it exactly? It’s something that we will cover in the next section
Methodology
Alright, now we need to understand the methodology. There’s a lot of mathematical equatons here to unpack and understand. But I don’t think I will go in details what each equations are. What we will do is understand the high level concept of what actually they’re doing to achieve this no propagation methodology. So, let’s peel this off one by one
Understanding first what happens during backpropagation
Alright, for now have maybe have a good grasp on what happen during backpropagation. We know we start backpropagation by using the loss function. We find the difference between our predicted output and the true output using whatever metrics of loss function that we use. And once we got the error value, the error value will be send back propagated through all the layers, carried over by the error gradient. During this backpropagation, all the learnable parameters (the weight and biases) will be updated.
Alright, each parameter updates will be different based on the input and the output of our data. For example, if we’re training our neural network on language. Maybe the network on the top right will be updated more if they’re handling the structure/grammar of the language. And maybe the bottom part of the network will be updated more if they’re trained on the context of the language.
This is what the guy that invented the NoProp method call as latent trajectory. It means that the latent information during the training will have different trajectory and determine which part of the network will be updated. Do you see where this is going?
This means that, it seems totally unnecessary for us to update the entire network for one latent trajectory. What we need to do is stochasticly pick the parts of the network that needs to be updated based on its latent trajectory. And this is where No Prop comes with their solution.
So, how might we start?
First, we don’t use the standard loss function as we normally do in backpropagation. Our training objective will be differewnt. We will use ELBO (Evidence Lower Bound) as our training objective. It is defined as below:
ELBO is actually something that is used as loss function for variational autoencoders. In my last autoencoder deep explainer, we saw that normal autoencoders has issues in the sense that their latent space is not structured.. Meaning, the latent space has categorical overlap between each other (entangle between each other). For the same input and output, it maybe has separate network structure that handles the specific context of the training.
However, when changed to different input and output, there will be different part of latent space/or network structure that handles specific context. The solution comes from the variational autoencoder (VAEs) that used ELBO as their loss function and training objectives.
You may read more about my AE deep explainer here. https://medium.com/@maercaestro/understanding-autoencoder-part-2-navigating-the-tesseract-128b0ee39311
So, what does ELBO do in essence? Say, if we want to model the probability of seeing some data x, x maybe an image, sound or text.
But, your model has some latent variables, something that you cannot observe. For example the amount of ‘cuteness’ in that image. We can call that z. With z available, your model becomes incomplete. You can’t possibly model z without x, and you can’t possible model x without z. In mathematics, the best way to solve this is to integrate all possible hidden states of z to model all possible form of x.
But this is impossible to solve as because it will be too high dimensional (has toooo many factors to account for) and it has no closed forms.
So, to solve this, we use a helper distribution q(z|x). We find the distribution of our latent variables to help us approximate p(x). We can first write it as below:
then we add the distribution of z into our integral
by using Jensen’s Inequality (which is something that I need to understand in detail), the equation will gives us a lower bound.
That right hand side is what we called as Evidence Lower Bound.
To use it in our training, we can write it as below:
And we try to maximize it to make our model better.
By doing this, we can model x and include all the latent variables and the hidden states inside it. This is how we achieve the latent trajectory that we intended when we train our model using NoProp.
And then what?
Alright, now we know our loss function. Our training objectives. What we want next is to use it for our training. So, basically, in NoProp, instead of just ensuring that we can predict the ground truth as close as possbile, we also wants to the model to learn the meaningful sequence of hidden/latent variables which represent the internal hidden states (or trajectory) that the model uses to arrive at the prediction.
So to explain all the latent variables, use all our layers inside our network and make them our z. So all the hideen states now will be identified as z.
To better illustrate it,
x → z_0 → z_1 → ... → z_T → outputAll the z will be sampled using the equation below
This z sampling will be done for both the forward (z_t) and the reverse mode (z_(t-1)).
Once we have sample all the layers, until the end layers, we will use that to predict our y.
And from there, we will use ELBO to find the loss of our prediction. And here comes the novel part.
The Novel Part
The novel part here is that, we’re actually using a modified ELBO as our loss function. Still remember our ELBO equation?
In No-Prop paper, we write ELBO as below
In Summary
Alright, tooo much mathematics. My head is spinning already. But what can we here is that, we simplify this into a simple flow, as below
Label y
│
▼
Sample z_T ~ q(z_T | y)
↓
Sample z_{T-1}, ..., z_0 ~ q(· | z_{t+1}) ← reverse path
↓
Forward z_0 → z_1 → ... → z_T using p(z_t | z_{t-1}, x) ← generative model
↓
Predict ŷ from z_T
↓
Compute NoProp loss (ELBO):
- Cross-entropy (from output)
- KL divergence (z_0)
- L2 loss per block
↓
Update weights locallySo, that’s basically the simple flow of our No Propagation method. And that’s the gist.
In the next part, we will try to implement the No Propagation method fully on a neural network. I’m still training it, and it looks not good actually (because there’s no official implementation so far), but we will see how it goes.
















