In the last video, you saw how if you have features for each movie, such as features x_1 and x_2 that tell you how much is this a romance movie and how much is this an action movie. Then you can use basically linear regression to learn to predict movie ratings. But what if you don't have those features, x_1 and x_2? Let's take a look at how you can learn or come up with those features x_1 and x_2 from the data. Here's the data that we had before. But what if instead of having these numbers for x_1 and x_2, we didn't know in advance what the values of the features x_1 and x_2 were? I'm going to replace them with question marks over here. Now, just for the purposes of illustration, let's say we had somehow already learned parameters for the four users. Let's say that we learned parameters w^1 equals 5 and 0 and b^1 equals 0, for user one. W^2 is also 5, 0 b^2, 0. W^3 is 0, 5 b^3 is 0, and for user four W^4 is also 0, 5 and b^4 0, 0. We'll worry later about how we might have come up with these parameters, w and b. But let's say we have them already. As a reminder, to predict music j's rating on movie i, we're going to use w^j dot product, the features of x_i plus b^j. To simplify this example, all the values of b are actually equal to 0. Just to reduce a little bit of writing, I'm going to ignore b for the rest of this example. Let's take a look at how we can try to guess what might be reasonable features for movie one. If these are the parameters you have on the left, then given that Alice rated movie one, 5, we should have that w^1.x^1 should be about equal to 5 and w^2.x^2 should also be about equal to 5 because Bob rated it 5. W^3.x^1 should be close to 0 and w^4.x^1 should be close to 0 as well. The question is, given these values for w that we have up here, what choice for x_1 will cause these values to be right? Well, one possible choice would be if the features for that first movie, were 1, 0 in which case, w^1.x^1 will be equal to 5, w^2.x^1 will be equal to 5 and similarly, w^3 or w^4 dot product with this feature vector x_1 would be equal to 0. What we have is that if you have the parameters for all four users here, and if you have four ratings in this example that you want to try to match, you can take a reasonable guess at what lists a feature vector x_1 for movie one that would make good predictions for these four ratings up on top. Similarly, if you have these parameter vectors, you can also try to come up with a feature vector x_2 for the second movie, feature vector x_3 for the third movie, and so on to try to make the algorithm's predictions on these additional movies close to what was actually the ratings given by the users. Let's come up with a cost function for actually learning the values of x_1 and x_2. By the way, notice that this works only because we have parameters for four users. That's what allows us to try to guess appropriate features, x_1. This is why in a typical linear regression application if you had just a single user, you don't actually have enough information to figure out what would be the features, x_1 and x_2, which is why in the linear regression contexts that you saw in course 1, you can't come up with features x_1 and x_2 from scratch. But in collaborative filtering, is because you have ratings from multiple users of the same item with the same movie. That's what makes it possible to try to guess what are possible values for these features. Given w^1, b^1, w^2, b^2, and so on through w^n_u and b^n_u, for the n subscript u users. If you want to learn the features x^i for a specific movie, i is a cost function we could use which is that. I'm going to want to minimize squared error as usual. If the predicted rating by user j on movie i is given by this, let's take the squared difference from the actual movie rating y,i,j. As before, let's sum over all the users j. But this will be a sum over all values of j, where r, i, j is equal to I. I'll add a 1.5 there as usual. As I defined this as a cost function for x^i. Then if we minimize this as a function of x^i you be choosing the features for movie i. So therefore all the users J that have rated movie i, we will try to minimize the squared difference between what your choice of features x^i results in terms of the predicted movie rating minus the actual movie rating that the user had given it. Then finally, if we want to add a regularization term, we add the usual plus Lambda over 2, K equals 1 through n, where n as usual is the number of features of x^i squared. Lastly, to learn all the features x1 through x^n_m because we have n_m movies, we can take this cost function on top and sum it over all the movies. Sum from i equals 1 through the number of movies and then just take this term from above and this becomes a cost function for learning the features for all of the movies in the dataset. So if you have parameters w and b, all the users, then minimizing this cost function as a function of x1 through x^n_m using gradient descent or cellular algorithm, this will actually allow you to take a pretty good guess at learning good features for the movies. This is pretty remarkable for most machine learning applications the features had to be externally given but in this algorithm, we can actually learn the features for a given movie. But what we've done so far in this video, we assumed you had those parameters w and b for the different users. Where do you get those parameters from? Well, let's put together the algorithm from the last video for learning w and b and what we just talked about in this video for learning x and that will give us our collaborative filtering algorithm. Here's the cost function for learning the features. This is what we had derived on the last slide. Now, it turns out that if we put these two together, this term here is exactly the same as this term here. Notice that sum over j of all values of i is that r,i,j equals 1 is the same as summing over all values of i with all j where r,i,j is equal to 1. This summation is just summing over all user movie pairs where there is a rating. What I'm going to do is put these two cost functions together and have this where I'm just writing out the summation more explicitly as summing over all pairs i and j, where we do have a rating of the usual squared cost function and then let me take the regularization term from learning the parameters w and b, and put that here and take the regularization term from learning the features x and put them here and this ends up being our overall cost function for learning w, b, and x. It turns out that if you minimize this cost function as a function of w and b as y as x, then this algorithm actually works. Here's what I mean. If we had three users and two movies and if you have ratings for these four movies, but not those two, over here does, is it sums over all the users. For user 1 has determined the cost function for this, for user 2 has determined the cost function for this, for user 3 has determined the cost function for this. We're summing over users first and then having one term for each movie where there is a rating. But an alternative way to carry out the summation is to first look at movie 1, that's what this summation here does, and then to include all the users that rated movie 1, and then look at movie 2 and have a term for all the users that had rated movie 2. You see that in both cases we're just summing over these four areas where the user had rated the corresponding movie. That's why this summation on top and this summation here are the two ways of summing over all of the pairs where the user had rated that movie. How do you minimize this cost function as a function of w, b, and x? One thing you could do is to use gradient descent. In course 1 when we learned about linear regression, this is the gradient descent algorithm you had seen, where we had the cost function J, which is a function of the parameters w and b, and we'd apply gradient descent as follows. With collaborative filtering, the cost function is in a function of just w and b is now a function of w, b, and x. I'm using w and b here to denote the parameters for all of the users and x here just informally to denote the features of all of the movies. But if you're able to take partial derivatives with respect to the different parameters, you can then continue to update the parameters as follows. But now we need to optimize this with respect to x as well. We also will want to update each of these parameters x using gradient descent as follows. It turns out that if you do this, then you actually find pretty good values of w and b as well as x. In this formulation of the problem, the parameters of w and b, and x is also a parameter. Then finally, to learn the values of x, we also will update x as x minus the partial derivative respect to x of the cost w, b, x. I'm using the notation here a little bit informally and not keeping very careful track of the superscripts and subscripts, but the key takeaway I hope you have from this is that the parameters to this model are w and b, and x now is also a parameter, which is why we minimize the cost function as a function of all three of these sets of parameters, w and b, as well as x. The average we derived is called collaborative filtering, and the name collaborative filtering refers to the sense that because multiple users have rated the same movie collaboratively, given you a sense of what this movie maybe like, that allows you to guess what are appropriate features for that movie, and this in turn allows you to predict how other users that haven't yet rated that same movie may decide to rate it in the future. This collaborative filtering is this gathering of data from multiple users. This collaboration between users to help you predict ratings for even other users in the future. So far, our problem formulation has used movie ratings from 1- 5 stars or from 0- 5 stars. A very common use case of recommended systems is when you have binary labels such as that the user favors, or like, or interact with an item. In the next video, let's take a look at a generalization of the model that you've seen so far to binary labels. Let's go see that in the next video.