How does PCA work? If you have a dataset with two features, x_1 and x_2. Initially, your data is plotted or represented using axes x_1 and x_2. But you want to replace these two features with just one feature. How can you choose a new axis, let's call it the z-axis, that is somehow a good feature for capturing, of representing the data? Let's take a look at how PCA does this. Here's the data sets with five training examples. Remember, this is an unsupervised learning algorithm so we just have x_1 and x_2, there is no label y. An example here like this may have coordinates x_1 equals 10 and x_2 equals 8. If we don't want to use the x_1, x_2 axes, how can we pick some different axis with which to capture what's in the data or with which to represent the data? One note on preprocessing, before applying the next few steps of PCA the features should be first normalized to have zero mean and I've already done that here. If the features x_1 and x_2 take on very different scales, for example, if you remember our housing example, if x_1 was the size of a house in square feet, and x_2 was the number of bedrooms then x_1 could be 1,000 or a couple of thousand, whereas x_2 is a small number. If the features take on very different scales, then you will first perform feature scaling before applying the next few steps of PCA. Assuming the features have been normalized to have zero mean, so subtract the mean from each feature and then maybe apply feature scaling as well so the ranges are not too far apart. What does PCA do next? To examine what PCA does, let me remove the x_1 and x_2 axes so that we're just left with the five training examples. This dot here represents the origin. The position of zero on this plot still. What we have to do now with PCA is, pick one axis instead of the two axes that we had previously with which to capture what's important about these five examples. If we were to choose this axis to be our new z-axis, it's actually the same as the x_1 axis just for this example. Then what we're saying is that for this example, we're going to just capture this value, this coordinate on the z-axis. For the second example, we're going to capture this value, and then this will capture this value, and so on for all five examples. Another way of saying this is that we're going to take each of these examples and project it down to a point on the z-axis. The word project refers to that you're taking this example and bringing it to the z-axis using this line segment that's at a 90-degree angle to the z-axis. This little box here is used to denote that this line segment is at 90 degrees to the z-axis. The term project just means you're taking a point and finding this corresponding point on the z-axis using this line segment that's at 90 degrees. Picking this direction as a z-axis is not a bad choice, but there's some even better choices. This choice isn't too bad because when you project your examples onto the z-axis, you still capture quite a lot of the spread of the data. These five points here, they're pretty spread apart so you're still capturing a lot of the variation or a lot of the variance in the original dataset. By that I mean these five points are quite spread apart and so the variance or variation among these five points, the projections of the data onto the z-axis is decently large. What that means is we're still capturing quite a lot of the information in the original five examples. Let's look at some other possible choices for the axis z. Here's another choice. This is actually not a great choice. But if I were to choose this as my z axis, then if I take those same five examples and project them down to the z-axis, I end up with these five points. You notice that compare it to the previous choice, these five points are quite squished together. The amount they are differing from each other, or their variance or the variation is much less. What this means is with this choice of z, you're capturing much less of the information in the original dataset because you've partially squish all five examples together. Let's look at one last choice, which is if I choose this to be the z-axis. This is actually a better choice than the previous two that we saw, because if we take the data's projections onto the z-axis, we find that these dots over here, they're actually quite far apart. We're capturing a lot of the variation, a lot of the information in the original dataset, even though we're now using just one coordinate or one number to represent or to capture each of the training examples instead of two numbers or two coordinates, X_1 and X_2. In the PCA algorithm, this axis is called the principal component. In the z-axis that when you project the data onto it, you end up with the largest possible amounts of variance. If you were to reduce the data to one axis or to one feature, this principal component is actually a good choice, and this is what PCA will do. If you want to reduce the data to one-dimensional feature, then it will choose this principal component. Let me show you a visualization of how different choices of the axis affects the projection. Here we have 10 training examples, and as we slide this slider here, and you can play with this in one of the optional labs yourself. As you slide the slider here, the angle of the z-axis changes. What you're seeing on the left is each of the examples projected via that short line segment at 90 degrees to the z-axis. Here on the right is that projection of the data, meaning the value of these 10 examples, z-coordinate. You notice that when I set the axis to about here, the points are quite squished together. So this posses less of the automation of the original data. Whereas if I set the z-axis say to this, then these points vary much more. This is capturing much more of the information in the original dataset. That's why the principal component corresponds to setting the z-axis to about here. This is the choice that PCA would make if you asked it to reduce the data to one dimension. Machine learning library, like scikit-learn, which you'll hear more about in the next video, can help you automatically find the principal component. But let's take a little bit deeper into how that works. Here are my x_1 and x_2 axis. Here's one training example with coordinates 2 on the x_1 axis and three on the x_2 axis. Let's say that PCA has found this direction for the z-axis. What I'm drawing here, this little arrow is a length 1 vector pointing in the direction of this z-axis that PCA will choose or that we have chosen. It turns out this length 1 vector is the vector 0.710, 0.71 rounded off a bit. It's actually 0.707 and then a bunch of other digits. Given this example with coordinates 2,3 on the x_1, x_2 axis, how do we project this example onto the z-axis? It turns out the formula for doing so is to take a dot product between the vector 2,3 and this vector 0.71, 0.71. If you do that, 2,3 dot product with 0.71, 0.71 turns out to be 2 times 0.71 plus 3 times 0.71, which is equal to 3.55. What that means is the distance from the origin of this point over here is 3.55, which means that if we were to represent or to use one number to try to capture this example, that one number is 3.55. So far, we have talked about how to use PCA to reduce data down to one dimension or down to one number. We did so by finding the principal component, also called sometimes the first principal component. In this example, we had found this as the first axis. It turns out that if you were to pick a second axis, the second axis will always be at 90 degrees to the first axis. If you were to choose even a third axis, then the third axis will be at 90 degrees to the first and the second axis. By the way, in mathematics, 90 degrees is sometimes called perpendicular. The term perpendicular just means at 90 degrees. Mathematicians will sometimes say the second axis, z_2, is at 90 degrees or is perpendicular to the first axis, z_1. If you choose additional axes, they're also at 90 degrees or perpendicular to z_1 and z_2 and to any other axes that PCA will choose. If you had 50 features and wanted to find three principal components, then if that's the first axis, the second axis will be at 90 degrees to it. Then the third axis will also be at 90 degrees to the first and the second axis. Now, one question I'm often asked is, how is PCA different from linear regression? It turns out PCA is not linear regression, is a totally different algorithm. Let me explain why. With linear regression, which is a supervised learning algorithm, you have data x and y. Here's a data set where the horizontal axis is the feature x and the vertical axis here is the label y. With linear regression you're trying to fit a straight line so that the predicted value is as close as possible to the ground truth label y. In other words, you're trying to minimize the length of these little line segments which are in the vertical direction. They just aligned with the y axis. In contrast, in PCA, there is no ground truth label y. You just have unlabeled data, X1 and X2, and furthermore, you're not trying to fit a line to use X1 to predict X2. Instead, the average treats X1 and X2 equally. We're trying to find this axis Z, that it turns out we end up making these little line segments small when you project the data onto Z. In linear regression, there is one number Y, which is given very special treatment. We're always trying to measure distance between the fitted line and Y, which is why these distances are measured just in the direction of the y-axis. Whereas in PCA, you can have a lot of features, X1, X2, maybe all the way up to X50 if you have 50 features. All 50 features are treated equally. We're just trying to find an axis Z so that when the data is projected onto the axis Z using these line segments that you still retain as much of the variance of the original data as possible. I know that when I plot these things in two-dimensions, we've just two features, which is, I can draw on a flat computer monitor. These arrows look like maybe they're a little bit similar. But when you have more than two features, which is most of the case, the difference between linear regression and PCA and what the algorithms do is very large. These algorithms are used for totally different purposes and give you very different answers. When linear regression is used to predict a target output Y and PCA is trying to take a lot of features and treat them all equally and reduce the number of axis needed to represent the data well. It turns out that maximizing the spread of these projections will correspond to minimizing the distances of these line segments, the distances to the points have to move to be projected down to Z. To illustrate the difference between linear regression and PCA in another way, if you have a data set that looks like this, linear regression, all it can do is fit a line that looks like that. Whereas if your data set looks like this, PCA will choose this to be the principal component. So you should use linear regression if you're trying to predict the value of y, and you should use PCA if you're trying to reduce the number of features in your data set, say to visualize it. Finally, before we wrap up this video, there's one more thing you could do with PCA, which is, recall this example which was at coordinates 2,3. We found that if you projected to the z-axis, you end up with 3.55. One thing you could do is if you have an example where Z equals 3.55, given just this one number Z, 3.55, can we try to figure out what was the original example? It turns out that there's a step in PCA called reconstruction, which is to try to go from this one number Z equals 3.55 back to the original two numbers, X1 and X2. It turns out you don't have enough information to get back X1 and X2 exactly, but you can try to approximate it. In particular, the formula is, you would take this number 3.55, which is Z, and multiply it by the length one vector that we had just now, which is 0.71, 0.71. This ends up to be 2.52, 2.52, which is this point over here. We can approximate the original training example, which was a coordinates 2, 3 with this new point here, which is at 2.52, 2.52. The difference between the original point and the projected point is this little line segment here. In this case is not a bad approximation 2.52, 2.52 is not that far from 2, 3. With just one number, we could get a reasonable approximation to the coordinates of the original training example. This is called the reconstruction step of PCA. To summarize, the PCA algorithm looks at your original data and chooses one or more new axis, Z or maybe Z1 and Z2, to represent your data and by taking your original data set and projecting it onto your new axis or axis. This gives you a smaller set of numbers so you can plot if wished to visualize your data. You're seeing the math. Let's now take a look at how you can implement this in code. In the next video, we'll look at how you can use PCA yourself using the scikit-learn library. Let's go on to the next video.