How a CNN (Convolutional Neural Network) Works


Convolutional Neural Network

A convolutional neural network (abbreviated as CNN or ConvNet from English convolutional neural network) is a type of feed-forward neural network inspired by the organization of the visual cortex.

As we will see later, a CNN consists of multiple stages, and similar to what happens in the visual cortex, each stage specializes in different tasks. Without going into detail, we can say that the human brain simplifies information to allow us to recognize objects, and our neural network will do the same. For additional information, one of the intermediate stages in the brain specializes in extracting shapes or characteristics of the image being viewed. We will find the same thing in CNNs.

A convolutional neural network generally operates like all other feed-forward networks. It consists of an input block, one or more hidden layers (hidden layers) that perform calculations using activation functions (for example, RELU), and an output block that performs the actual classification. The difference from classic feed-forward networks lies in the presence of convolutional layers.

So, what role do these convolutional layers play in the process?

Convolutional layers perform a crucial job as they extract features from images using filters.

Unlike a traditional feed-forward network that processes the “general information of the image”, a CNN classifies the image based on specific characteristics. Depending on the type of filter used, different features can be identified in the reference image, such as the contours of shapes, vertical lines, horizontal lines, diagonals, etc.

Let’s simplify the concept.

Compared to a simple feed-forward network, a CNN can handle more specific information, making it more efficient. To simplify, the operation of a CNN can be represented as:


Furthermore, considering that the ReLU function is an integral part of the Conv layer, we can reduce the CNN to the following schema:

Let’s understand the role of each block.

Typically, each convolutional layer is followed by a Max-Pooling layer, gradually reducing the matrix size while increasing the level of “abstraction”. This transition goes from elementary filters, such as vertical and horizontal lines, to more sophisticated filters capable of recognizing features like headlights, windshields, etc., until the last level where it can distinguish a car from a truck.

The input layer consists of a sequence of neurons capable of receiving the image’s information. This layer will receive the data vector representing the image pixels. For example, for a 32 x 32 color image, the input vector will have a length of 32 x 32 x 3; for every pixel in the 32 x 32 image, there will be three values representing the image’s three colors in RGB format (Red, Green, and Blue).

The Convolutional Layer (Conv) is the main layer of the network. Its goal is to detect patterns, such as curves, angles, circles, or squares depicted in an image with high precision. There can be multiple filters, and the more filters, the more complex the features that can be detected. But how does the convolutional layer work? In practice, a digital filter (a small mask) slides over different positions of the input image; for each position, an output value is generated by performing the dot product between the mask and the covered portion of the input (both treated as vectors).

In the example in the figure, the filter is represented by a 3×3 matrix, so the scanning brush will only consider a 3×3 portion of the input image with which the convolution filter is multiplied.

Following is the result with an example filter:

In conclusion, the image will be scanned piece by piece, resulting in a smaller matrix of values that represents the “characterized image”.

Processo di windowing

The ReLU Layer (Rectified Linear Units) aims to nullify unuseful values obtained in the previous layers and is placed after the convolutional layers.

The Pooling Layer identifies whether the studied feature is present in the previous layer and roughens the image while retaining the feature used by the convolutional layer. In other words, the pooling layer aggregates information, generating smaller feature maps.

FC Layer (or Fully Connected): This is the layer that effectively classifies the images.

Se vuoi farmi qualche richiesta o contattarmi per un aiuto riempi il seguente form

    0 0 votes
    Article Rating
    Notify of
    0 Commenti
    Inline Feedbacks
    View all comments
    Would love your thoughts, please comment.x