The Mathematics of Deep Learning

Published in

DataSeries

5 min readJan 20, 2020

Geoffrey Hinton, the Godfather of Deep Learning, is a professor of the University of Toronto and a researcher at Google Brain. In 2018, he won the Turing award for his work in artificial neural networks. He is the Chief Scientific Advisor and co-founder of the vector institute, which is among the greatest A.I research institutes across the world. The reason for his esteemed status in the A.I community comes from his research in self-organizing neural networks through a revolutionary procedure known as back-propagation.

Overview

Back-propagation is a machine learning procedure for networks of neuron-like units. It works by modifying the synapses of the network to develop an internal structure that is relevant to the task. The algorithm adjusts the weights of the connections of the network as to minimize the difference between the output and the desired output.

Internal or hidden units, not part of the input or output, come to represent important features of the task. The patterns that enable learning are found by the interactions of the hidden units. Back-propagation enables the creation of useful new features!

Representation of a simple deep neural network. The red lines indicate the back-propagation.

How it Works

Self-organizing neural networks come in an indefinite variety of forms in many learning tasks. They work by modifying the internal structure of the network. This is easily accomplished when if the input units are directly connected to the outputs. To find the learning rules of the network, you must iteratively adjust the strength of the connections as to gradually reduce the difference from the actual and desired vector.

Learning becomes a lot more difficult when hidden layers are added. The algorithm has to decide somehow what circumstance the hidden layers should be active in order to achieve the desired behavior of the input and output.

In its simplest form, the algorithm is for bottom to top networks, with hidden layers that can be bypassed if need be. You cannot have connections within the same layer or connections that go from top to bottom. The input is set by setting the units to particular states. Then the states of the hidden layers are adjusted by a function to the connections of the previous or lower layer.

Total input x to unit j is a linear function of the outputs y with weights w on the connections.

Bias is added by adding extra input to each unit. A unit has a real-value of y which is a non-linear function of the total input.

States happen in parallel and the layers work upward sequentially. A representation of the network makes it more easily understandable:

Any input-output function is sufficient insofar as it is a bounded derivative. This variation of a linear function for combining the inputs to a unit and applying the non-linear function simplifies the learning algorithm.

As previously mentioned we require that the weights ensure that each input the output produced is the same or close to the desired output. This makes the algorithm a form of supervised learning. The error is defined by comparing the actual and desired output for each case where error E is

c is the input-output cases, j is output units, y is the state of the output, d is the goal state

Gradient descent is required to compute the partial derivative of E with respect to each weight. This is the sum of the partial derivatives. The backward propagation works by passing these derivatives from the top layer back down to the bottom. You first compute the derivative of the error with respect to the output.

Then apply the chain rule for the derivative E with respect to the input.

And then differentiate the non-linear function and substituting:

This allows us to know how the change in the input to the output x affects the error. You can easily determine how this would effect the states by taking a derivative of weight w

Since we can equate each unit of the layer we can continue this throughout the layers downward. This can be accomplished by using computers with parallel hardware and most computers of today us parallel distributed processing.

This type of network is referred to as a feed-forward network as the synapses are changed in a single direction throughout the network.

There is a few drawbacks with this type of network, for one the error may contain a local minimum so that the gradient descent does not guarantee a global maximum which is what we are trying to find to optimize the difference between the actual and desired output. This happens quite rarely and often these networks preform adequately. Furthermore, it is more important to have as many layers as required to find the patterns so as not to decrease efficiency and to minimize the require weight-space.

This type of artificial network does not appropriately model the learning that our brains use but it may be biologically plausible given its success rate.

Conclusion

With this early research into back-propagation Dr.Hinton developed a procedure in which can be fully implemented and used to change synaptic weights of the internal structure of an artificial network. This preliminary research lead its way to Hinton receiving far better success rates on the Image Net challenge in 2012. Using this technique, researchers throughout the world are seeing AI supremacy in many tasks from computer vision to chess.

Sources:

Geoffrey Hinton, David Rumelhart and Williams. Learning representations by back-propagation errors, 1986

Geoffrey Hinton Wikipedia -https://en.wikipedia.org/wiki/Geoffrey_Hinton

The Mathematics of Deep Learning

Overview

How it Works

Conclusion

Written by Nonenonenone