In this final video on intuition for backprop, let's take a look at how the computation draft works on a larger neural network example. Here's the network we will use with a single hidden layer, with a single hidden unit that outputs a1, that feeds into the output layer that outputs the final prediction a2. To make the math more tractable, I'm going to continue to use just a single training example with inputs x = 1, y = 5. And these will be the parameters of the network. And throughout we're going to use the ReLU activation functions of g(z) = max(0, z). So for prop in your network looks like this. As usual, a1 equals g(w1 times x + b1). And so it turns out w1x + b will be positive. So we're in the max(0, z) = z, parts of this activation function. So that's just equal to this, which is 2 times 1, that's w1 is 2 times x1 + 0, that's b1, which is equal to 2. And then similarly, a2 equals this, g(w2a1 + b2) which is w2 times a1 + b. Again, because we're in the positive part of the ReLU activation function, which is 3 x 2 + 1 = 7. Finally, we'll use the squared error cost function. So j(w, b) is 1/2(a2- y) squared = 1/2(7-5) squared, which is 1/2 of 2 squared, which is just equal to 2. So, let's take this calculation that we just did and write it down in the form of a computation graph. To carry out the computation step by step, first thing we need to do is take w1 and multiply that by x. So we have w1 that feeds into computation node that computes w1 times x. And I'm going to call this a temporary variable t1. Next, we compute z1, which is this term here, which is t1 + b1. So we also have this input b1 over here. And finally a1 equals g(z1). We apply the activation function, and so we end up with again this value here, 2. And then next we have to compute t2, which is w2 times a1. And so, with w2, that gives us this value which is 6. Then z2, which is this quantity, we had to b2 it and that gives us 7. And finally apply the activation function, g. We still end up with 7. And lastly, j is 1/2(a2- y) squared. And that gives us 2. Which was this cost function here. So this is how you take the step by step computations for larger neural network and write it in the computation graph. You've already seen in the last video, the mechanics of how to carry out backprop. I'm not going to go through the step by step calculations here. But if you were to carry out backprop, the first thing you do is ask, what is the derivative of the cost function j respect to a2? And it turns out if you calculate that, it turns out to be 2. So we'll fill that in here. And the next step will be asked, what's the derivative of the cost j respect to z2. And using this derivative that we computed previously, you can figure out that this turns out to be 2. Because if z goes up by epsilon, you can show that for the current setting of all the parameters a2 will go up by epsilon. And therefore, j will go up by 2 times epsilon. So this derivative is equal to 2, and so on. Step by step. We can then find out that the derivative of j respective b2 is also equal to 2. The derivative respect to t2 is equal to 2, and so on, and so forth. Until eventually you've computed the derivative of j with respect to all the parameters w1, b1, w2, and b2. And so that's backprop. And again, I didn't go through the mechanical steps of every single step of backprop. But it's basically the process that you saw in the previous video. Let me just double check one of these examples. So we saw here that the derivative of j respect w1 is equal to 6. So what this is predicting is that, if w1 goes up by epsilon, j should go up by roughly 6 times epsilon. Let's step through the map and see if that really is true. These are the calculations that we did, again. And so if w which was 2 were to be 2.001 goes up by epsilon, then a1 becomes, let's see, instead of 2, this is 2.001 as well. So a1 instead of 2 is now 2.001. So 3 x 2.001 + 1, this gives us 7.003. And if a2 is 7.003, then just becomes 7.003- 5 squared. And so this becomes 2.003 squared over 2, which turns out to be equal to 2.006005. So ignoring some of the extra digits, you see from this little calculation that, if w1 goes up by 0.001, j of w has gone up from 2 to 2.006 roughly. So 6 times as much. And so the derivative of j with respect to w1 is indeed equal to 6. And so the backprop procedure gives you a very efficient way to compute all of these derivatives. Which you can then feed into the gradient descent algorithm or the Adam optimization algorithm, to then train the parameters of your neural network. And again, the reason we use background for this is, is a very efficient way to compute all the derivatives of j respect to w1, j respect to b1, j respect to w2, and j respect to b2. I did just illustrate how we could bump up w1 by a little bit and see how much j changes. But that was a left to right calculation. And then we had to do this procedure for each parameter, one parameter at a time. If we had to increase w by 0.001 to see how that changes j. Increase b1 by a little bit to see how that changes j, and increase every parameter, one at a time by a little bit to see how that changes j. Then this becomes a very inefficient calculation. And if you had N nodes in your computation graph and P parameters, this procedure would end up taking N times P steps, which is very inefficient. Whereas we got all four of these derivatives N + P, rather than N times P steps. And this makes a huge difference in practical neural networks, where the number of nodes and the number of parameters can be really large. So, that's the end of the video for this week. Thanks for sticking with me through the end of these optional videos. And I hope that you now have an intuition for when you use a program frameworks, like tensorflow, to train a neural network. What's actually happening under the hood and how is using the computation graph to efficiently compute derivatives for you. Many years ago, before the rise of frameworks like tensorflow and pytorch, researchers used to have to manually use calculus to compute the derivatives of the neural networks that they wanted to train. And so in modern program frameworks you can specify forwardprop and have it take care of backprop for you. Many years ago, researchers used to write down the neural network by hand, manually use calculus to compute the derivatives. And then neural implement a bunch of equations that they had laboriously derived on paper, to implement backprop. Thanks to the computation graph and these techniques for automatically carrying out derivative calculations. Is sometimes called autodiff, for automatic differentiation. This process of researchers manually using calculus to take derivatives is no longer really done. At least, I've not had to do this for many years now myself, because of autodiff. So, many years ago, to use neural networks, the bar for the amount of calculus you have to know actually used to be higher. But because of automatic differentiation algorithms, usually based on the computation graph, you can now implement a neural network and get derivatives computed for you easier than before. So maybe with the maturing of neural networks, the amount of calculus you need to know in order to get these algorithms work, has actually gone down. And that's been encouraging for a lot of people. And so, that's it for the videos for this week. I hope you enjoy the labs and I look forward to seeing you next week.