You've seen how in TensorFlow you can specify a neural network architecture to compute the output y as a function of the input x, and also specify a cost function, and TensorFlow will then automatically use back propagation to compute derivatives and use gradient descent or Adam to train the parameters of a neural network. The backpropagation algorithm, which computes derivatives of your cost function with respect to the parameters, is a key algorithm in neural network learning. But how does it actually work? In this and in the next few optional videos, we'll try to take a look at how backpropagation computes derivatives. These videos are completely optional and they do go just a little bit into calculus. If you're already familiar with calculus I hope you enjoy these videos, but if not, it's totally fine. We'll build up from the very basics of calculus to try to make sure you have all the intuition you need to understand how backpropagation works. Let's take a look. I'm going to use a simplified cost function, J of w equals w squared. The cost function is a function of the parameters w and say, b and for this simplified cost function let's just pretend J of w equals w squared. I'm going to ignore b for this example. Let's say the value of the parameter w is equal to 3. J of w will be equal to 9, w squared of 3 squared. Now if we were to increase w by a tiny amount, say Epsilon, which I'm going to set to 0.001. How does the value of J of w change? If we increase w by 0.001 then w becomes 3 plus 0.001, so it's 3.001. J of w, which is w squared, which we defined above, is now this 3.001 squared, which is 9.006001. What we see is that if w goes up by 0.001, I'm going to use this up arrow here to denote w goes up by 0.001, where 0.001 is this small value Epsilon. Then J of w roughly goes up by 6 times as much, 6 times 0.001. This isn't quite exactly. It actually goes up not to 9.006 but 9.006001. But it turns out that if Epsilon where infinitesimally small, and by infinitesimally small I mean very small. Epsilon is pretty small, but it's not infinitesimally small. If Epsilon was 0.00000, lots of zeros followed by one, then this becomes more and more accurate. In this example what we see is that if w goes up by Epsilon then J goes up roughly by 6 times Epsilon. In calculus what we would say is that the derivative of J of w with respect to w is equal to 6. All this means is if w goes up by a tiny little amount J of w goes up six times as much. What if Epsilon were to take on different value? What if Epsilon were 0.002. In this case w would be 3 plus 0.002, and w squared becomes 3.002 squared, which is 9.012004. In this case what we conclude is that if w goes up by 0.002 then J of w goes up by roughly 6 times 0.002. It goes up roughly to 9.012, and this 0.012 is roughly 6 times 0.002. That again, is a little bit off. This is extra 0.00004 here, because Epsilon is not quite infinitesimally small. Once again, we see this six to one ratio between how much w goes up versus how much J of w goes up. That's why the derivative of J of w with respect to w is equal to six. A small Epsilon is the more accurate this becomes. By the way, feel free to pause the video and try this calculation now yourself with other values of Epsilon. The key is that so long as Epsilon is pretty small the ratio by which J of w goes up versus the amount by which w goes up should be 6-1. Feel free to try it out yourself with other values of Epsilon, then check if this really holds true. This leads us to an informal definition of the derivative, which is that whenever w goes up by a tiny amount Epsilon that causes J of w to go up by k times Epsilon. In our example just now k was equal to six. Then we say that the derivative of J of w with respect to w is equal to k, which was equal to 6 in the example just now. You might remember when implementing gradient descent you will repeatedly use this rule to update the parameter w J, where as usual Alpha is the learning rate. What is gradient descent do? Notice that if the derivative is small, then this updates that will make a small update to the parameter W_j, whereas if this derivative term is large, this will result in a big change to the parameter W_j. This makes sense because this is essentially saying that if the derivative is small, this means that changing w doesn't make a big difference to the value of j and so let's not bother to make a huge change to W_j. But if the derivative is large, that means that even a tiny change the W_j can make a big difference in how much you can change or decrease the cost function j of w. In that case, let's make a bigger change to W_j, because doing so will actually make a big difference to how much we can reduce the cost function J. Let's take a look at a few more examples of derivatives. What you saw in the example just now was that if w equals 3, and j of w equals w squared equals 9, then if w goes up by Epsilon by 0.01, then j of w becomes j of 3.01 is now 9.006001. Or in other words, j has gone up by about 0.006, which is 6 times 0.001 or 6 times Epsilon, which is why the derivative of w with respect to W is equal to 6. Let's look at what the derivative will be for other values of w, take w equals 2. In this case, j of w is, w squared is now equal to 4, and if w goes up by 0.001, then J of w becomes j of 2.001, which is equal to this 4.004001 and so j of w has gone up from four to this value over here, which is roughly four times Epsilon bigger than four, which is why now the derivative is four. Because w going up by Epsilon has caused j of w to go up four times as much. Again, there's extra 0.001 is because it's not quite accurate because Epsilon is an infinitesimally small. Or let's look at another example. What if w were equal to negative 3? J of w which is w squared, is still equal to 9 because negative 3 squared is 9. If w was to go up by Epsilon again, then you now have w equals negative 2.999, so that's j of negative 2.999. The square of negative 2.999 is equal to this 8.994001, because w is negative 3 plus 0.001. Notice here, j of w has gone down by about 0.006, which is six times Epsilon. What we have in this example is that j starts off as 9, but it has now gone down. Notice this down arrow here [inaudible] arrow by 6 times Epsilon or equivalently it has gone up by negative 6 times Epsilon. That's why the derivative in this case is equal to negative 6. Because w going up by Epsilon causes j of w to go up by negative 6 times Epsilon when Epsilon is small. Another way to visualize this is to plot the function J of w, so that the horizontal axis is w and this is J of w, then when w is equal to 3, J of w is equal to 9. When it's negative 3 it's also equal to 9, and when it is 2, J of w is equal to 4. Let me make an observation that may be relevant if you've taken a calculus class before. But if you haven't, what I say in the next 60 seconds may not make sense, but don't worry about it. You will need to understand it to fully follow the rest of these videos. If you've taken a class in calculus at some point, you may recognize that the derivatives corresponds to the slope of a line that just touches the function J of w at this point, say where w equals 3. The slope of this line at this point, and the slope is this height over this width turns out to be equal to 6 when w equals 3, the slope of this line turns out to be 4 when w equals 2 and the slope of this line turns out to be negative 6 when w equals negative 3. It turns out in calculus, the slope of these lines correspond to the derivative of the function. But if you haven't taken a calculus class before and haven't seen this slope concept before, don't worry about it. Now, there's one last observation I want to make before moving on, which is that you see in all three of these examples, J of w is the same function, J of w is equal to w squared. But the derivative of J of w depends on w, when w is three, the derivative is six. When w is two, the derivative is four. When w is negative 3, the derivative is negative 6. It turns out that if you are familiar with calculus, and again, it's totally fine if you're not, calculus can allow us to calculate the derivative of J of w in respect to w as 2 times w. In a little bit, I'll show you how you can use Python to compute these derivatives yourself using a nifty Python package called SymPy. But because calculus tells us that the derivative of w squared J of w is 2w, that's why the derivative when w is three is 2 times 3 or when is two is 2 times 2, or when is negative 3 is 2 times negative 3 because this value of w times 2 turns out to give you the derivative. Let's go through just a few more examples before we wrap up. For these examples, I'm going to set w equals 2. You saw on the last slide, if J of w is w squared, then the derivative I said would be 2 times w, which was 4. If w goes up by 0.01, this being Epsilon, J of w becomes this so roughly J of w goes up by 4 times Epsilon. Let's look at a few other functions. What if J of w is equal to w cubed? In this case, w cubed, 2 cubed would be equal to 8, or what if J of w is just equal to w? Here, w will be equal to 2. Or what if J of w was 1 over w? In this case, 1 over w, 1 over 2 would be 1/2 or 0.5. What is the derivative of J of w with respect to w when the cost function J of w is either w cubed or w or 1 over w. Let me show you how you can compute these derivatives yourself using a library and package called SymPy. Let me first import SymPy. What I'm going to do is tell SymPy that I'm going to use J and w as symbols for computing derivatives. For our first example, we had the cost function J was equal to w squared. Notice how SymPy actually renders it in this nifty font here as well. If we were to use SymPy to take the derivative of J with respect to w, we should do as follows. You see that SymPy tells you this derivative is 2w. Let me actually choose a variable, dJ, dw, we set that to be equal to this, just type it again here. Print it out. There's 2w. If you want to plug in the value of w into this expression to evaluate it, you can do the derivative.subs w, 2. This means plug-in a value of w to be equal to 2 into this expression and evaluate it. That gives you the value of four, which is why when w equals to 2, we saw the derivative of J was equal to 4. Let's look at some other examples. What if J was w cubed? Then the derivative becomes 3 times w squared. It turns out from calculus, and this is what SymPy is calculating for us, if J is w cubed, then the derivative of J with respect to w is 3w squared. Depending on what w is, the value of the derivative changes as well. We can plug in if w equal to 2, you get 12 in this case. Or what if it was J equal to w? In this case, the derivative is just equal to 1. Or the final example we have was what if J equals 1 over w? In this case, the derivative turns out to be negative 1 over w squared. This is negative 1 over 4. What I'm going to do is take the derivatives we have worked out. Remember for w squared, it was 2w, for w cubed, it was 3w squared. For w is just 1 and 1 over w it's negative 1 over w squared. Let's copy this back to our other slide. What SymPy or really calculus showed us is if J of w is w cubed, the derivative is 3w squared, which is equal to 12 when w equals 2, when J of w equals w, the derivative is just equal to 1. When J of w is 1 over w is negative 1 over w squared, which is negative 1/4 when w equals 2. Let's start. We'll check if these expressions that we got from SymPy are correct. Let's try increasing w by Epsilon, in this case J of w. Feel free to pause the video and check this math on your own calculator if you want. But in this case, J of w to 0.001 cubed becomes this. So J has gone up from 8 to 8.012 roughly. It's gone up by roughly 12 times Epsilon. Thus the derivative is indeed 12. Or if J of w equals w, then if w increase by Epsilon, then J of w, which is just w, is now 2.001. So it's gone up by 0.01, which is exactly the value of Epsilon. So J of w has gone up by 1 times Epsilon. The derivative is indeed equal to 1. Notice that here, this is actually, exactly Epsilon even though Epsilon is infinitesimally small. On our last example, if J of w equals 1/w, if w goes up by Epsilon, then w is 1/2.001, then it turns out J of w is approximately 4.9975 with some extra digits that are truncated. But this turns out to be 0.5 minus 0.00025. J of w has started off at 0.5 and it's gone down by 0.00025. This 0.00025, it is 0.25 times Epsilon. It's gone down by this amount or it's gone up by negative 0.25 times Epsilon because negative 0.25 times Epsilon is equal to this sum over here. We see that if w goes up by Epsilon, J of w goes up by negative 1/4 or negative 0.25 times Epsilon, which is why the derivative in this case is negative 1/4. I hope that with these examples you have a sense of what the derivative with respect to w of J of w means. It just is if w goes up by Epsilon, how much does J of w goes up by some constant k times Epsilon. This constant k is the derivative. The value of k will depend both on what is the function J of w, as well as what is the value of w. Before we wrap up this video, I want to briefly touch on the notation used to write derivatives that you may see in other texts. Which is that if J of w is a function of a single variable, say w, then mathematicians will sometimes write the derivative as d/dw of J of w. Notice here this notation is using the lowercase letter d. Whereas in contrast, if J is a function of more than one variable, then mathematicians will sometimes use this squiggly alternative d to denote the derivative of J with respect to one of the parameters w_i. To my mind, this notation distinguishing between this regular letter d and this stylize calculus derivative symbol d, it makes little sense to me to make this distinction and this notation to my mind over complicates calculus and derive this notation. But for historical reasons, calculus text will use these two different notations depending on whether J is a function of a single variable or a function of multiple variables. But I think for practical purposes, this notational convention, it tends to just over-complicate things, I think, in a way that I don't think is actually necessary. For this class, I'm just going to use this notation everywhere, even when there's just a single variable. In fact, for most of our applications, the function J it is a function of more than one variable. So this other notation, which is sometimes called the partial derivative notation, this is actually the correct notation almost all the time because J usually has more than one variable. But I hope that using this notation throughout these lectures that simplifies the presentation and makes derivatives little bit easier to understand. In fact, this notation is the one you've been seeing in the videos leading up to now. For conciseness, instead of writing out this full expression here, sometimes you also see it shortened as derivative or partial derivative of J with respect to w_i or written like this. These are just simplified abbreviated forms of this expression over here. I hope that gives you a sense of what are derivatives. It's just if w goes up by a little bit, by Epsilon, how much does J of w change as a consequence. Next, let's take a look at how you can compute derivatives in a neural network. To do so, we need to take a look at something called a computation graph. Let's go take a look at that in the next video.