Convolutional Neural Networks in a Nutshell

6 min readMay 17, 2023

Traditional machine learning systems employ Artificial Intelligence methods that learn from large amounts of data and apply samples to find suitable solutions to different problems. Typically, they use clustering to find groupings in data based on similarities.

A convolutional neural network is based on a special architecture designed to perform efficient pattern recognition. The architecture of neural networks was proposed by Yann LeCun in 1988.

Below, the experts at Grapherex have compiled the main characteristics of convolutional neural networks and described how they work.

What Are Convolutional Neural Networks?

Convolutional neural networks (CNNs) are machine learning models used in computer vision to recognise images and videos, segment them, and detect target objects.

CNNs are deep learning models with multiple hidden layers that allow them to learn complex representations for various tasks. Along with other neural networks, CNNs mimic the structure and function of the human brain.

Let’s look at what they do exactly. Image and video recognition focuses on classifying or recognising objects, scenes, or actions in photos or videos. Object detection aims to pinpoint specific items in an image or video. Image segmentation, in turn, involves dividing an image into individual segments or regions that are relevant for subsequent analysis.

How Do Convolutional Neural Networks Work?

CNNs consist of node layers containing an input layer, several hidden layers, and an output layer. Logically, the information goes from input to output, transforming from its original form and producing a meaningful result.

As a CNN goes through layers of growing complexity, the network can detect larger portions of the image. The initial layers concentrate on basic features like colours and edges. As it moves through the successive layers, it gradually discerns more significant components and shapes of the object. Ultimately, it achieves recognition of the intended item.

Let’s say you upload an image of an animal. The neural network checks layer-by-layer for matching and mismatching characteristics that are in its database and outputs the result. For example, it decides that you have a picture of a cat. Let’s learn more about each of the layers.

The Input Layer

This is the first layer in the CNN. The neural network takes input data — an image or video — and sends it to the next layer. The first layer doesn’t process anything. The input data (any data you give the CNN to process) will be filtered by convolutional layers, and the results will be sent to further processing layers.

The Convolutional Layer

This layer extracts key features, such as edges, corners, and shapes. To do this, it applies a set of filters or kernels. The convolutional layer is the main building block of the CNN, where most of the computation takes place.

It requires a feature detector that moves through the image fields, checking to see if the desired feature is present. This is called convolution. The detector (filter) is usually a 3×3 matrix, which, in turn, determines the size of the receptive field. Then it calculates a dot product between the input pixels and the filter and sends it into the output array.

The filter is then shifted by a line, repeating the process until it covers the entire image. The end result of a series of dot products is called a feature map, activation map, or convolved map. The first convolution layer can be followed by more layers so that the CNN structure can be hierarchical. This is how the network recognises lower-level (like pixels of a particular colour) and higher-level (like a part of a car) patterns.

The ReLU Layer

It is often useful to see intermediate results. That’s why a convolution operation could be followed by a rectified linear unit (ReLU) transformation. If applied, we get a model out of non-linearity.

A ReLU function is a common thing to see in CNNs, as it helps ensure non-linearity at the output and improves network performance. ReLU outputs the input value directly if it is positive and outputs zero if it is negative.

The Pooling Layer or Downsampling

Convolutional feature maps are generated using a pooling layer, also known as downsampling. This reduces their dimensionality because it reduces the number of parameters in the input.

There are two ways to do this: max pooling and average pooling. Max pooling is used more frequently, as it takes the maximum value in each feature map section as the output. Average pooling takes the average value within the receptive field.

Even though we lose a lot of information, the layer still has many advantages. It helps reduce complexity, increase efficiency, and limit the risk of overfitting.

The Fully-Connected Layer

In partially connected layers, the pixel values of the input image are not directly linked to the output layer. However, in a fully-connected layer, each node in the output layer is directly linked to a node in the previous layer.

This layer performs the task of classification based on the features extracted by the previous layers and their various filters. FC layers typically use a softmax activation function to classify input data appropriately, creating a probability between 0 and 1. The FC layer produces a final output, which is used for classification or prediction tasks.

How CNN Categorises Pictures: A Step-by-Step Process

To step away from the technical explanation of how the network works, let’s look at a real-life example. Here’s how an image passes through each layer and what happens there.

The input layer receives an image of a dog. The standard format of the image is a 3-channel (RGB).
The convolutional layer uses filters and feature detectors to understand edges, corners, and forms.
The ReLU layer turns a convolutional layer output into a non-linearity.
The pooling layer takes the maximum value in each section to reduce the dimensionality of the feature maps created by the convolution layer.
The process runs in circles to get complicated characteristics from the input.
The fully connected layer receives the flattened output of the last pooling layer and applies a set of weights to output the result. Here, the CNN identifies whether the image is a dog or not.

Each CNN is trained on a set of tagged or labelled images. It learns to minimise the error between predicted and actual labels. Once trained, the CNN can accurately classify new, unseen images.

Types of CNNs

There are several types of convolutional neural networks: traditional, recurrent, fully convolutional, and spatial transformer networks. Below, we’ll briefly discuss each of them.

Traditional CNNs

Traditional CNNs, known as “vanilla” CNNs, consist of convolutional and pooling layers followed by fully connected layers. They are effective for image recognition tasks and have been widely used in computer vision. This was demonstrated by the Lenet-5 architecture for handwritten digit recognition.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) process sequential data and handle inputs of varying lengths by keeping track of previous inputs (the context). They are usually used in natural language processing (NLP) tasks like text generation and language translation.

Fully Convolutional Networks

Fully Convolutional Networks (FCNs) are specifically used for computer vision tasks like image segmentation/classification and object detection. They are based solely on convolutional layers and are computationally efficient and adaptable, as they don’t have fully connected layers. FCNs are trained using backpropagation to categorise images.

Spatial Transformer Networks

Spatial Transformer Networks (STNs) are used in computer vision tasks to improve the ability of a network to recognise patterns or objects regardless of their position, orientation, or scale. They apply learned spatial transformations to input images to enhance the network’s performance for specific tasks, including alignment, correction of the perspective distortion, or spatial changes (flipping, rotating, or translating the image).

Neural Networks and Computer Vision

As we mentioned, convolutional neural networks are used in computer vision tasks that apply convolutional layers to extract features from input data. Computer vision is a field of AI that focuses on enabling machines to interpret visual data and give recommendations based on the meaningful information they derive from images.

Computer vision is applied in social media marketing, healthcare (radiology technology), and e-commerce. Modern self-driving vehicles and automobile producers are now starting to apply computer vision in their sphere.