A good way to develop a content-based filtering algorithm is to use deep learning. The approach you see in this video is the way that many important commercial state-of-the-art content-based filtering algorithms are built today. Let's take a look. Recall that in our approach, given a feature vector describing a user, such as age and gender, and country, and so on, we have to compute the vector v_u, and similarly, given a vector describing a movie such as year of release, the stars in the movie, and so on, we have to compute a vector v_m. In order to do the former, we're going to use a neural network. The first neural network will be what we'll call the user network. Here's an example of user network, that takes as input the list of features of the user, x_u, so the age, the gender, the country of the user, and so on. Then using a few layers, say dense neural network layers, it will output this vector v_u that describes the user. Notice that in this neural network, the output layer has 32 units, and so v_u is actually a list of 32 numbers. Unlike most of the neural networks that we were using earlier, the final layer is not a layer with one unit, it's a layer with 32 units. Similarly, to compute v_m for a movie, we can have a movie network as follows, that takes as input features of the movie and through a few layers of a neural network is outputting v_m, that vector that describes the movie. Finally, we'll predict the rating of this user on that movie as v_ u dot product with v_m. Notice that the user network and the movie network can hypothetically have different numbers of hidden layers and different numbers of units per hidden layer. All the output layer needs to have the same size of the same dimension. In the description you've seen so far, we were predicting the 1-5 or 0-5 star movie rating. If we had binary labels, if y was to the user like or favor an item, then you can also modify this algorithm to output. Instead of v_u.v_m, you can apply the sigmoid function to that and use this to predict the probability that's y^i,j is 1. To flesh out this notation, we can also add superscripts i and j here if we want to emphasize that this is the prediction by user j on movie i. I've drawn here the user network and the movie network as two separate neural networks. But it turns out that we can actually draw them together in a single diagram as if it was a single neural network. This is what it looks like. On the upper portion of this diagram, we have the user network which inputs x_u and ends up computing v_u. On the lower portion of this diagram, we have what was the movie network, the input is x_m and ends up computing v_m, and these two vectors are then dot-product together. This dot here represents dot product, and this gives us our prediction. Now, this model has a lot of parameters. Each of these layers of a neural network has a usual set of parameters of the neural network. How do you train all the parameters of both the user network and the movie network? What we're going to do is construct a cost function J, which is going to be very similar to the cost function that you saw in collaborative filtering, which is assuming that you do have some data of some users having rated some movies, we're going to sum over all pairs i and j of where you have labels, where i,j equals 1 of the difference between the prediction. That would be v_u^j dot product with v_m^i minus y^ij squared. The way we would train this model is depending on the parameters of the neural network, you end up with different vectors here for the users and for the movies. What we'd like to do is train the parameters of the neural network so that you end up with vectors for the users and for the movies that results in small squared error into predictions you get out here. To be clear, there's no separate training procedure for the user and movie networks. This expression down here, this is the cost function used to train all the parameters of the user and the movie networks. We're going to judge the two networks according to how well v_u and v_m predict y^ij, and with this cost function, we're going to use gradient descent or some other optimization algorithm to tune the parameters of the neural network to cause the cost function J to be as small as possible. If you want to regularize this model, we can also add the usual neural network regularization term to encourage the neural networks to keep the values of their parameters small. It turns out, after you've trained this model, you can also use this to find similar items. This is akin to what we have seen with collaborative filtering features, helping you find similar items as well. Let's take a look. V_u^j is a vector of length 32 that describes a user j that have features x_ u^j. Similarly, v^i_m is a vector of length 32 that describes a movie with these features over here. Given a specific movie, what if you want to find other movies similar to it? Well, this vector v^i_m describes the movie i. If you want to find other movies similar to it, you can then look for other movies k so that the distance between the vector describing movie k and the vector describing movie i, that the squared distance is small. This expression plays a role similar to what we had previously with collaborative filtering, where we talked about finding a movie with features x^k that was similar to the features x^i. Thus, with this approach, you can also find items similar to a given item. One final note, this can be pre-computed ahead of time. By that I mean, you can run a compute server overnight to go through the list of all your movies and for every movie, find similar movies to it, so that tomorrow, if a user comes to the website and they're browsing a specific movie, you can already have pre-computed to 10 or 20 most similar movies to show to the user at that time. The fact that you can pre-compute ahead of time what's similar to a given movie, will turn out to be important later when we talk about scaling up this approach to a very large catalog of movies. That's how you can use deep learning to build a content-based filtering algorithm. You might remember when we were talking about decision trees and the pros and cons of decision trees versus neural networks. I mentioned that one of the benefits of neural networks is that it's easier to take multiple neural networks and put them together to make them work in console to build a larger system. What you just saw was actually an example of that, where we could take a user network and the movie network and put them together, and then take the inner product of the outputs. This ability to put two neural networks together this how we've managed to come up with a more complex architecture that turns out to be quite powerful. One notes, if you're implementing these algorithms in practice, I find that developers often end up spending a lot of time carefully designing the features needed to feed into these content-based filtering algorithms. If we end up building one of these systems commercially, it may be worth spending some time engineering good features for this application as well. In terms of these applications, one limitation that the algorithm as we've described it is it can be computationally very expensive to run if you have a large catalog of a lot of different movies you may want to recommend. In the next video, let's take a look at some of the practical issues and how you can modify this algorithm to make a scale that are working on even very large item catalogs. Let's go see that in the next video.