Understanding Feed-Forward Mechanisms and Back Propagation


Finally! we are beginning to delve into the heart of the deep neural network, starting to discuss in more detail the feed-forward neural networks. To do this, I will try to simplify as much as possible all the mathematical steps necessary to build a neural network of this type, which, as I had already mentioned in the previous article, is characterized by the fact that the output of one layer is used as input for the next layer. I want to clarify that feed-forward neural networks are very important because, in general, they can be used to solve problems where there is a need to qualify/classify the input data. They are, in fact, the basis for creating very complex neural network models, including, to start mentioning

for example, CNNs (convolutional neural networks) often used in classification, tracking, and comparison of images; these implement different levels of specialization and their feature is that they are able to extract “features” which constitute, in fact, the signature to distinguish one content from another … but we will talk about this in other articles.

Just to provide some more definitions, I’ll add that multi-layer or multilayer feed-forward neural networks are also called multilayer perceptrons (MLP) and are among the most studied and widely used neural network models.

Let’s try to organize the next steps, hopefully not too boring.

In this article, I will approach the explanation of feed-forward networks trying to simplify the mathematical component and, to achieve the goal, I will exclusively use sigmoid neurons. This choice will indeed be useful later, when we delve into the functioning of the error correction mechanism through the back propagation method, which will be essential to understand, in detail, the operation of the training phase of a supervised network.

Using a supervised neural network means dividing the problem into two phases:

  • Training Phase
  • Prediction Phase

in the first phase, we must focus on creating a model/classifier by training the network

while the second phase allows us to use the built model to make predictions or classifications.

At this point, let’s try to understand how to build a neural network for this purpose.

Let’s start again by fixing, once again, simple concepts already extensively introduced in previous articles.

We define the internal activation (A) of the neuron as the weighted sum of all input signals.

In the specific case of a neuron with 4 inputs and one output, for example:


A = x1*w1 + x2*w2 + x3*w3 +x4*w4 + b


where b is a constant called bias (impulse) which is used to better adapt the network, and can be imagined as an additional neuron that does not accept input values, while the output Y of the neuron in this case is


Let’s try to understand what all this means by making some practical examples and, to further simplify, we further reduce the complexity of the problem using as an example a neuron with one input and one output.

In this case, the internal activation function (A) is:

A = x*w + b

while Y is

At this point, it becomes quite clear that, depending on the values of w and b that will be chosen during the training phase (which is our goal) the behavior of the neural network will tend to change. To understand the concept, let’s do some small experiments by randomly choosing values of w and b, then applying them to the neuron model. By doing so experimentally, we can appreciate how the neuron’s output function varies, and therefore, the network’s behavior. To do this, we can use some tools, and in this case, I relied on Youmath.

Test #1




Test #2




Test #3




Looking at the three graphs, it is evident how different the behavior of the neuron is with varying parameters w and b, whose search for optimal values is “the purpose” of the training phase; in other words, finding the right values of w and b means finding the neural network that performs best in the prediction phase.

Let’s make things a little more complicated and do three more tests, increasing, however, the complexity of our neuron which, this time, will have two inputs and only one output.


A = x1*w1 + x2*w2 + b

In this case, our Y will be:

Let’s also do tests in this case by randomly choosing w and b values:

Test #4





Test #5




Test #6




Again, it is easy to understand from the figures how the behavior of the neuron changes significantly with the variation of the weights.

Of course, using a single neuron is not always possible and is not, at the same time, the optimal solution. The use of a single neuron is possible, in fact, only when you have to solve linearly separable problems, in this case, the learning algorithm will tend to find a solution in a finite number of steps, otherwise, it will cycle in search of weights without ever reaching an optimal solution.

What does this mean?

It means that you can use a single neuron only when the expected values can be described through a plane, or in any case, non-closed curves.

And if this were not possible?

In this case, the deep neural network comes to our aid, i.e., a multi-layer neural network, with which it is possible to model more complex curves.

Let’s give an example using one of the simplest deep neural networks, i.e., a neural network with two inputs and one output with a hidden layer with 3 neurons.

Just to give some meaning to our wait, let’s write the mathematical model of the neural network, for whose implementation we must remember that the input layer does not enter the weighted sum as the output of the input layer neurons is exactly the input data vector.

Thus, doing some calculations:

h1= F(x1*w11+x2*w12 + b1)

h2 = F(x1*w21 + x2*w22+b2)

h3 = F(x1*w31 + x2*w32+b3)

Y = F(h1*w1 + h2 * w2 + h3*w3 +b )

where F is always the Sigmoid function, so our Y will be:

below is one of the possible representations of Y as the weights vary:


All this to understand how complex the neural network can become as the number of neurons and levels increases. For those who want to have fun, you can try to write the mathematical model referring to the network shown in the following figure, which is still very small compared to those that are normally used in practice:

At this point, there is only one last step, namely, how to choose the weights and with which principle? Again, there is a mathematical tool to help us called the back propagation of the error, thanks to which the network is able to learn and automatically modify the weights by comparing, in the training phase of the network, the obtained result with the expected result (the real one).

Once again, let’s use our simplified model of a neuron with a single input and a single output:

A = x*w + b

In this case, the back propagation error algorithm will work according to the following approach:

In practice, starting from random values of w and b, a test is done using input data whose output value is known; the error value is calculated as the difference between the known output value and the obtained one, and the delta y is the multiplication between the error value and the derivative of Y, which, in our case, being a sigmoid, is precisely Y*(1-Y) (as you may remember, I had anticipated that the sigmoid would be useful for simplification). At this point, simply calculate w and b applying the formula above.

The value of epsilon, called the learning factor, is up to us, but it would be wise not to assign too small values to it because otherwise, you risk not achieving optimal network convergence; otherwise, assigning too large values would increase the speed of the training phase, as well as the possibility of errors and strong oscillations. The values of epsilon must always be between 0 and 1.

In conclusion, for the moment, let’s accept the procedure without asking too many questions and, in the next articles, we will delve into the theme of back propagation trying to understand it with some practical experiment, also describing the usable algorithm.




Previous article (third part)





Se vuoi farmi qualche richiesta o contattarmi per un aiuto riempi il seguente form