We can better understand it from the following example:
Let’s assume a machine converts the kilometres to miles.
But we don’t have the formula to convert the kilometres to miles. We know both the values are linear, which means if we double the miles, then the kilometres also double.
The formula is presented this way:
Miles= Kilometres * C
Here, C is a constant, and we don’t know the exact value of the constant.
We have some universal truth value as a clue. The truth table is given below:
We are now going to use some random value of C and determine the result.
So, we are using the value of C as 0.5, and the value of kilometres is 100. That gives us 50 as the answer. As we know very well, according to the truth table, the value should be 62.137. So the error we have to find out as below:
error = truth – calculated
= 62.137 – 50
In the same manner, we can see the result in the image below:
Now, we have an error of 12.137. As previously discussed, the relationship between the miles and kilometres is linear. So, if we increase the value of the random constant C, we might be getting less error.
This time, we just change the value of C from 0.5 to 0.6 and reach the error value of 2.137, as shown in the image below:
Now, our error rate improves from 12.317 to 2.137. We can still improve the error by using more guesses on the value of C. We guess the value of C will be 0.6 to 0.7, and we reached the output error of -7.863.
This time the error crosses the truth table and the actual value. Then, we cross the minimum error. So, from the error, we can say that our result of 0.6 (error = 2.137) was better than 0.7 (error = -7.863).
Why did we not try with the small changes or learning rate of the constant value of C? We are just going to change the C value from 0.6 to 0.61, not to 0.7.
The value of C = 0.61, gives us a lesser error of 1.137 which is better than the 0.6 (error = 2.137).
Now we have the value of C, which is 0.61, and it gives an error of 1.137 only from the correct value of 62.137.
This is the gradient descent algorithm that helps find out the minimum error.
We convert the above scenario into python programming. We initialize all variables which we require for this python program. We also define the method kilo_mile, where we are passing a parameter C (constant).
In the code below, we define only the stop conditions and maximum iteration. As we mentioned, the code will be stopping either when the maximum iteration has been achieved or the error value greater than the precision. As a result, the constant value automatically achieves the value of 0.6213, which has a minor error. So our gradient descent will also work like this.
Gradient Descent in Python
We import the required packages and along with the Sklearn built-in datasets. Then we set the learning rate and several iterations as shown below in the image:
We have shown the sigmoid function in the above image. Now, we convert that into a mathematical form, as shown in the below image. We also import the Sklearn built-in dataset, which has two features and two centers.
Now, we can see the values of X and shape. The shape shows that the total number of rows is 1000 and the two columns as we set before.
We add one column at the end of each row X to use the bias as a trainable value, as shown below. Now, the shape of X is 1000 rows and three columns.
We also reshape the y, and now it has 1000 rows and one column as shown below:
We define the weight matrix also with the help of the shape of the X as shown below:
Now, we created the derivative of the sigmoid and assumed that the value of X would be after passing through the sigmoid activation function, which we have shown before.
Then we loop till the number of iterations that we already set has been reached. We find out the predictions after passing through the sigmoid activation functions. We calculate the error, and we calculate the gradient to update the weights as shown below in the code. We also save the loss on every epoch to the history list to display the loss graph.
Now, we can see them at every epoch. The error is decreasing.
Now, we can see that the value of error is reducing continuously. So this is a gradient descent algorithm.