Explanation in the most Simplest way
The Perceptron is one of the simplest ANN architecture. It mainly consists of Input Layer and Output Layer. The Perceptrons are linear model, and hence they are incapable of learning complex patterns. However, the limitations of Perceptron can be eliminated by stacking multiple Perceptrons ,the resulting ANN is called a Multi Layer Perceptron.
The Neural Network we are going to discuss consists of Input Layers, Hidden Layer and a Single Output layer.
Mathematics Associated with ANN
First let us understand the variables : x1,x2,x3 are the input vectors ; w1,w2,w3,w4 are the respective weights associated with the branches; z is the output of the hidden layer after application of Activation Function(we will come to it later in the post); y’ is the predicted output.
Lets’ understand the mathematics associated with it :-
y=f ( x1*w1 + x2 *w2 + x3*w3)
The inputs are multiplied by their respective weights and an Activation Function( f(x) ) is a applied on top of it.There are various Activation Functions out there, some of them are ReLU, Leaky ReLU, PRelu, ELU, SeLU, Sigmoid, tanh, Softmax (which I will discuss in some other post), but in most cases ReLU works well, so we will go with it. Thus the output “z” after application of the Activation Function will be
z= max(0,x1*w1 + x2 *w2 + x3*w3) = x1*w1 + x2 *w2 + x3*w3
Note : we have assumed that our input and initial weights are positive. So, at last we have
z= x1*w1 + x2 *w2 + x3*w3
Now the output of hidden layer i.e “z” will be multiplied by the final weight w4 of the output layer
Note :- Since we are talking in context of Binary Classification so we will take g(x) as a Sigmoid Function
The final predicted output “ y’ ” will be a scalar value. Let us consider “ y ” to be the actual value or the ground truth. So we will compute the Cost Function or the Loss Function, (lets take Mean Squared Error (MSE) just for simplicity, in reality it could be any other fancy Loss Functions such as Categorical Cross Entropy).
L= (y’ — y)²/n
where “n” is the number of data points in the training sample
Once the Loss Function is calculated, we will then try to minimise this Loss Function , during Backpropagation by updating the previous weights using an Optimiser, (again lets say Gradient Descent just for simplicity, in reality we use more advanced Optimisers such as ADAM, NADAM, etc). The weight updation formula is :
W(i+1) = W(i) - n* dL/dW(i)
where “n” eta is the learning rate ,usually a very small value ; “dL/dW(i)” is the partial derivative of the Loss Function with respect to the i’th weight.
After all the weights have been updated and we have achieved a desirable loss, we then write a simple if statement which checks if the predicted output y’ is greater than a thresold value, if it is so then the output is +1 (or positive data point) ,otherwise -1 (or negative data point). The thresold value can be found by using the ROC(Receiver Operating Characteristic) curve.
. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron