In this video, you'll learn how you can construct your vectors based off a co-occurrence matrix. Specifically, depending on the task you are trying to solve, you can have several possible designs. You will also see how you can encode a word or in documents as a vector. Let me show you how you can do this. To get a vector space model using a word-by-word design, you will make a co-occurrence matrix and extract vector or presentations for the words in your corpus. You'll be able to get a vector space model using a word by document design using a similar approach. Finally, I'll show you how in a vector space you can find relationships between words and vectors, also known as their similarity. The co-occurrence of two different words is the number of times that they appear in your corpus together within a certain word distance k. For instance, suppose that your corpus has the following two sentences. The role of the co-occurrence matrix corresponding to the word data with a k value equal to 2, would be populated with the following values. For the column corresponding to the word simple, you'd get a value equal to 2, because data and simple co-occur in the first sentence within a distance of one word and in the second sentence within a distance of two words, the row of the co-occurrence matrix corresponding to the word data would look like this if you consider the co-occurrence with the words simple, raw, like and I. In this case, the vector representation of the word data would be equal to 2, 1, 1, 0. With a word by word design, you can get a representation with n entries, with n between one and the size of your entire vocabulary. For a word by document design, the process is quite similar. In this case, you will count the times that words from your vocabulary appear in documents that belong to specific categories. For instance, you could have a corpus consisting of documents between different topics like entertainment, economy, and machine learning. Here, you'd have to count the number of times that your words appear on the documents that belong to each of the three categories. In this example, suppose that the word data, appears 500 times in documents from your corpus related to entertainment, 6,620 times in economy documents, and 9,320 in documents related to machine learning. The word film appears in each document's category, 7,000, 4,000, and 1,000 times respectively. Can you get a sense of where this is going already? Once you've constructed the representations for multiple sets of documents or words, you'll guess your vector space. Let's take the matrix from the last example. Here you could take a representation for the words, data and film from the rows of the table. However, I'll take the representation for every category of documents by looking at the columns. The vector space will have two-dimensions. The number of times that the words, data and film appear on the type of documents. For the entertainment corpus, you'd have the following vector representations This one for the economy category, and that's for the machine learning category.Note that in this space it is easy to see that the economy and machine learning documents are much more similar than they are to the entertainment category. Coming up soon, you'll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them. So far you've seen how to get vector spaces by two different designs; word by word and word by document; by either counting the co-occurrence of words or the co-occurrence of words in the documents corpora. I've also showed you that in vector spaces, you can determine relationships between types of documents like similarity. Now you're becoming more and more familiar with these vector spaces. You've seen several possible designs that you can use to solve a specific task. You've also seen how you can encode words or tweets as vectors. In the next video, you'll learn about a new similarity metric that will allow you to compare these two vectors. The similarity metric is known as the Euclidean distance.