In this video, we'll take a look at how you can use the scikit-learn library to implement PCA. These are the main steps. First, if your features take on very different ranges of values, you can perform pre-processing to scale the features to take on comparable ranges of values. If you were looking at the features of different countries, those features take on very different ranges of values. GDP could be in trillions of dollars, whereas other features are less than 100. Feature scaling in applications like that would be important to help PCA find a good choice of axes for you. The next step then is to run the PCA algorithm to "fit" the data to obtain two or three new axes, Z_1, Z_2, and maybe Z_3. Here I'm assuming you want two or three axes if you want to visualize the data in 2D or 3D. If you have an application where you want more than two or three axes, the PCA implementation can also give you more than two or three axes, it's just that it'd then be harder to visualize. In scikit-learn, you will use the fit function, or the fit method in order to do this. The fit function in PCA automatically carries out mean normalization, it subtracts out the mean of each feature. So you don't need to separately perform mean normalization. After running the fit function, you would get the new axes, Z_1, Z_2, maybe Z_3, and in PCA, we also call these the principal components, where Z_1 is the first principal component, Z_2 the second principal component, and Z_3 the third principal component. After that, I would recommend taking a look at how much each of these new axes, or each of these new principal components explains the variance in your data. I'll show a concrete example of what this means on the next slide, but this lets you get a sense of whether or not projecting the data onto these axes help you to retain most of the variability, or most of the information in the original dataset. This is done using the explained variance ratio function. Finally, you can transform, meaning just project the data onto the new axes, onto the new principal components, which you will do with the transform method. Then for each training example, you would just have two or three numbers, you can then plot those two or three numbers to visualize your data. In detail, this is what PCA in code looks like. Here's the dataset X with six examples. X equals NumPy array, the six examples over here. To run PCA to reduce this data from two numbers, X_1, X_2 to just one number Z, you would run PCA and ask it to fit one principal component. N components here is equal to one, and fit PCA to X. Pca_1 here is my notation for PCA with a single principle component, with a single axis. It turns out, if you were to print out pca_1.explained_variance_ratio, this is 0.992. This tells you that in this example when you choose one axis, this captures 99.2 percent of the variability or of the information in the original dataset. Finally, if you want to take each of these training samples and project it to a single number, you would then call this, and this will output this array with six numbers corresponding to your six training examples. For example, the first training example 1,1, projected to the Z-axis gives you this number, 1.383, so on. So if you were to visualize this dataset using just one dimension, this will be the number I use to represent the first example. The second example is projected to be this number and so on. I hope you take a look at the optional lab where you see that these six examples have been projected down onto this axis, onto this line which is now Y. All six examples now lie on this line that looks like this. The first training example, which was 1,1, has been mapped to this example, which has a distance of 1.38 from the origin, so that's why this is 1.38. Just one more quick example. This data is two-dimensional data, and we reduced it to one dimensions. What if you were to compute two principal components? Starts with two-dimensions, and then also end up with two-dimensions. This isn't that useful for visualization but it might help us understand better how PCA and how they code for PCA works. Here's the same code except that I've changed n components to two. I'm going to ask the algorithm to find two principal components. If you do that the pca_2 explain ratio becomes 0.992, 0.008. What that means is that z_1, the first principle components, still continuous explain 99.2 percent of the variance, Z_2 the second principle components, or the second axis, explains 0.8 percent of the variance. These two numbers together add up to one. Because while this data is two-dimensional, so the two axes, Z_1 and Z_2, together they explain 100 percent of the variance in the data. If you were to transform our project the data onto the Z_1 and Z_2 axes, this is what you get, with now the first training example is napped too these two numbers, corresponding to its projection onto z_1, and z_2, and the second example, which is this projected onto z_1 and z_2, becomes these two numbers. If you were to reconstruct the original data roughly this is z_1, and this z_2, then the first training example which was a [1, 1] has a distance of 1.38 on the z_1 axis, has this number and the distance here of 0.29 hence this distance on the z_2 axis, and the reconstruction actually looks exactly the same as the original data. Because, if you reduce or not really reduce two-dimensional data to two-dimensional data, there is no approximation and you can get back to your original dataset, with the projections onto z_1 and z_2. This is what the code to run PCA looks like. I hope you take a look at the optional lab where you can play with this more yourself. Also try varying the parameters look at a specific example to deepen your intuition about how PCA works. Before wrapping up, I'd like to share a little bit of advice for applying PCA. PCA is frequently used for visualization where you reduce data to two or three numbers so you can plot it. Like you saw in an earlier video with the data on different countries so you can visualize different countries. There are some other applications of PCA that you may occasionally hear about. That used to be more popular, maybe 10,15, 20 years ago but much less so now. Another possible use of PCA is data compression. For example, if you have a database of lots of different cars, and you have 50 features per car, but it's just taking up too much space on your database or maybe transmitting 50 numbers over the Internet, just takes too long. Then one thing you could do is reduce these 50 features to a smaller number of features. It could be 10 features with 10 axes or 10 principal components. You can visualize 10-dimensional data that easily, but this is 1/5 of the storage space, or maybe 1/5 over the network transmission costs needed. Many years ago I saw PCA use for this application and more often, but today with modern storage being able to store pretty large datasets and modern networking, able to transmit faster and more data than ever before. I see this use much less often as an application of PCA. One of the applications of PCA that again used to be more common maybe 10 years ago, 20 years ago, but much less so now is using it to speed up training of a supervised learning model. Where the idea is, if you had 1,000 features, and having a 1,000 features may the supervised learning algorithm runs you slowly. Maybe you can reduce it to 100 features using PCA and then your dataset is basically smaller and your supervised learning algorithm may run faster. This used to make a difference in the running time of some of the older generations of learning algorithms, such as if you have had a support vector machines. This will speed up a support vector machine. But it turns out with modern machine learning algorithms, algorithms like deep learning, this doesn't actually hold that much, and is much more common to just take the high-dimensional dataset, and feed it into say your neural network. Rather than run PCA because PCA has some computational cost as well. You may hear about this in some of the other research papers, but I don't really see this done much anymore. But the most common thing that I use PCA for today is visualization and then I find it very useful to reduce the dimensional data to visualize it. Thanks for sticking with me through the end of the optional videos for this week, I hope you enjoy learning about PCA and that you find a useful when you get a new dataset for reducing the dimension of the dataset to two or three dimensions so you can visualize it and hopefully gain new insights into your data sets. There's helped me many times understand my own datasets and I hope that you find it equally useful as well. Thanks for watching these videos and I look forward to seeing you next week.