In order to apply anomaly detection, we're going to need to use the Gaussian distribution, which is also called the normal distribution. When you hear me say either Gaussian distribution or normal distribution, they mean exactly the same thing. If you've heard the bell-shaped distribution, that also refers to the same thing. But if you haven't heard of the bell-shaped distribution, that's fine too. But let's take a look at what is the Gaussian or the normal distribution. Say x is a number, and if x is a random number, sometimes called the random variable, x can take on random values. If the probability of x is given by a Gaussian or normal distribution with mean parameter Mu, and with variance Sigma squared. What that means is that the probability of x looks like a curve that goes like this. The center or the middle of the curve is given by the mean Mu, and the standard deviation or the width of this curve is given by that variance parameter Sigma. Technically, Sigma is called the standard deviation and the square of Sigma or Sigma squared is called the variance of the distribution. This curve here shows what is p of x or the probability of x. If you've heard of the bell-shaped curve, this is that bell-shaped curve because a lot of classic bells say in towers were shaped like this with the bell clapper hanging down here, and so the shape of this curve is vaguely reminiscent of the shape of the large bells that you will still find in some old buildings today. Better looking than my hand-drawn one. There's a picture of the Liberty Bell. Indeed, the Liberty Bell's shape on top is vaguely bell-shaped curve. If you're wondering what does p of x really means? Here's one way to interpret it. It means that if you were to get, say, 100 numbers drawn from this probability distribution, and you were to plot a histogram of these 100 numbers drawn from this distribution, you might get a histogram that looks like this. It looks vaguely bell-shaped. What this curve on the left indicates is not if you have just 100 examples or 1,000 or a million or a billion. But if you had a practically infinite number of examples, and you were to draw a histogram of this practically infinite number of examples with a very fine histogram bin. Then you end up with essentially this bell-shaped curve here on the left. The formula for p of x is given by this expression; p of x equals 1 over square root 2 Pi. Pi here is that 3.14159 or it's about 22 over 7. Ratio of a circle's diameter circumference times Sigma times e to the negative x minus Mu, the mean parameter squared divided by 2 Sigma squared. For any given value of Mu and Sigma, if you were to plot this function as a function of x, you get this type of bell-shaped curve that is centered at Mu, and with the width of this bell-shaped curve being determined by the parameter Sigma. Now let's look at a few examples of how changing Mu and Sigma will affect the Gaussian distribution. First, let me set Mu equals 0 and Sigma equals 1. Here's my plot of the Gaussian distribution with mean 0, Mu equals 0, and standard deviation Sigma equals 1. You notice that this distribution is centered at zero and that is the standard deviation Sigma is equal to 1. Now, let's reduce the standard deviation Sigma to 0.5. If you plot the Gaussian distribution with Mu equals 0 and Sigma equals 0.5, it now it looks like this. Notice that it's still centered at zero because Mu is zero. But it's become a much thinner curve because Sigma is now 1.5. You might recall that Sigma is the standard deviation is 0.5, whereas Sigma squared is also called the variance. That's equal to 0.5 squared or 0.25. You may have heard that probabilities always have to sum up to one, so that's why the area under the curve is always equal to one, which is why when the Gaussian distribution becomes skinnier, it has to become taller as well. Let's look at another value of Mu and Sigma. Now, I'm going to increase Sigma to 2, so the standard deviation is 2 and the variance is 4. This now creates a much wider distribution because Sigma here is now much larger, and because it's now a wider distribution is become shorter as well because the area under the curve is still equals 1. Finally, let's try changing the mean parameter Mu, and I'll leave Sigma equals 0.5. In this case, the center of the distribution Mu moves over here to the right. But the width of the distribution is the same as the one on top because the standard deviation is 0.5 in both of these cases on the right. This is how different choices of Mu and Sigma affect the Gaussian distribution. When you're applying this to anomaly detection, here's what you have to do. You are given a dataset of m examples, and here x is just a number. Here, are plots of the training sets with 11 examples. What we have to do is try to estimate what a good choice is for the mean parameter Mu, as well as for the variance parameter Sigma squared. Given a dataset like this, it would seem that a Gaussian distribution maybe looking like that with a center here and a standard deviation like that. This might be a pretty good fit to the data. The way you would compute Mu and Sigma squared mathematically is our estimate for Mu will be just the average of all the training examples. It's 1 over m times sum from i equals 1 through m of the values of your training examples. The value we will use to estimate Sigma squared will be the average of the squared difference between two examples, and that Mu that you just estimated here on the left. It turns out that if you implement these two formulas encodes with this value for Mu and this value for Sigma squared, then you pretty much get the Gaussian distribution that I hand drew on top. This will give you a choice of Mu and Sigma for a Gaussian distribution so that it looks like the 11 training samples might have been drawn from this Gaussian distribution. If you've taken an advanced statistics class, you may have heard that these formulas for Mu and Sigma squared are technically called the maximum likelihood estimates for Mu and Sigma. Some statistics classes will tell you to use the formula 1 over n minus 1 instead of 1 over m. In practice, using 1 over m or 1 over n minus 1 makes very little difference. I always use 1 over m, but just some other properties of dividing by m minus 1 that some statisticians prefer. But if you don't understand what they just said, don't worry about it. All you need to know is that if you set Mu according to this formula and Sigma squared according to this formula, you'd get a pretty good estimate of Mu and Sigma and in particular, you get a Gaussian distribution that will be a possible probability distribution in terms of what's the probability distribution that the training examples had come from. You can probably guess what comes next. If you were to get an example over here, then p of x is pretty high. Whereas if you were to get an example, we are here, then p of x is pretty low, which is why we would consider this example, okay, not really anomalous, not a lot like the other ones. Whereas an example we are here to be pretty unusual compared to the examples we've seen, and therefore more anomalous because p of x, which is the height of this curve, is much lower over here on the left compared to this point over here, closer to the middle. Now, we've done this only for when x is a number, as if you had only a single feature for your anomaly detection problem. For practical anomaly detection applications, you usually have a lot of different features. You've now seen how the Gaussian distribution works. If x is a single number, this corresponds to if, say you had just one feature for your anomaly detection problem. But for practical anomaly detection applications, you will have many features, two or three or some even larger number n of features. Let's take what you saw for a single Gaussian and use it to build a more sophisticated anomaly detection algorithm. They can handle multiple features. Let's go do that in the next video.