In this unit, we'll talk about how the assembler represents and handles symbols using binary code. So the general problem that we are facing is writing an assembler for the Hack machine language. And the development strategy that we adopted was one of delaying the treatment of symbols to a later stage. Well, this later stage has has come up, and we now have to deal with symbols explicitly. I want to remind you that we already know how to handle white space and instructions, and therefore we now want to focus on symbols. And as usual, I use red ink to highlight all the symbols in this particular example. And as you can see, there are quite a few of them. And I think I mentioned it before, this is very typical in programming in general and machine language programming in particular. The more symbols you have, the more expressive and easy to read is the program, and therefore, we, we encourage programmers to use symbols, which means that the assembler has to work hard to resolve these symbols into, into binary code. But once we do it, we achieve something very remarkable, because symbolic programming is a very nice abstraction that improves the quality of your program, and the quality of life of the programmer significantly. Okay, so, what kind of symbols do we have in Hack programs? As it turns out, they fall into three distinct categories. First of all, we have variable symbols. Variable symbols represent memory locations where the programmer wants to maintain some values that typically change in the course of executing the program. Now, this sentence is a little bit misleading, because the programmer doesn't care at all where these variables are actually located in memory. You know, as far as the programmer is, cons, is concerned he or she is going to say something like @sum, or @x, or @y, and let the assembler worry about where to store these or represent these variables in memory. Once again, this is a very nice abstraction that the assembler delivers to the practicing programmer. So this is one kind of of symbols we have to worry about. Another kind of symbols is label symbols that represent destinations of goto command, commands in plural. And in this example that we are facing here, we have three such symbols LOOP, STOP, and END. And finally, we may have all sorts of predefined symbols like virtual registers, screen, keyboard, and so on and so forth. And in the example that we have here, I think we have two, we happen to have two predefined symbols, which are R0 and R1. So, if you want to deal with symbols, these are the three kinds of symbols that you have to handle. So let me begin to describe what to do with every one of these symbols, and we'll start from the end. We'll start with predefined symbols. So according to the Hack language specification, the language features 23 pre-defined symbols. And here you see all of them. And how do you translate such a symbol into binary code? Well, first of all, we have to realize that these symbols come to play only in the context of a instructions. So the only play, place where you can see such a symbol popping up in a program is an instruction like @preDefinedSymbol. How do you translate this instruction into binary? Well, you simply replace the predefined symbols with its corresponding value, which is a decimal number. And now the only thing that remains is to translate the at decimal number into binary, and this is something that we discussed in the previous unit. So, case closed, we know how to handle predefined values. The next category of symbols that we have to deal with are symbols that denote labels. Now, label symbols are used to denote destinations in the program that I may want to jump to using goto commands. And they are declared very specifically using the pseudo-command, round parentheses, and the name of the label in the middle. I've used XXX to, to stand for this label, which can be any sequence of characters. And these things are called pseudo-commands because they don't generate any code. When we translate the program into binary, we don't translate the label declaration instructions, and that's why they're called pseudo-commands. And we'll, we'll talk about this further later on. So we declare label symbols using the agreed upon parentheses. And you might ask yourself, why parentheses? Why not label colon or something like this? Well when Norm and I designed this language, we decided to use this syntax once again, for reasons that will become clear later on. Now once we declare a symbol using round parentheses, the meaning of this declaration is that from now on, whenever we see XXX in the program, we mean to replace it with the address of the memory location that contains the next instruction in the program. Now, in order to make sense of what I just said, I have to keep track of instruction numbers. And that's what I do next, if you look at the left hand side of the slide, you will see that I have marked or labeled every instruction in the program with a running number that starts with zero. And notice that I'm skipping empty lines, empty lines are white space, which is tossed away, and also, I'm skipping the pseudo-instructions. So, label declaration instructions like LOOP, STOP, and END I've not counted, so to speak. Now, once I have these line numbers in front of me, or once I remember it somewhere in my memory, now I see that I can relate LOOP to the number 4, STOP to the number 18, and END to the number 22. So I can basically generate this association and keep it in the back of my mind. So from now on, whenever I see an instruction like @LOOP, we actually mean @4. You know, we want to go to instruction number four, and so on and so forth for the, for all the other label symbols in your program. So how do you deal with an @labelSymbol instruction? Well, all you have to do is look up the value of this instruction, which you figured out before, as, as I just explained, and replace it with the the symbol. What remains is an @value instruction where value is a decimal number. And, we know how to deal with this, because we discussed it in the previous unit. So, that's how you deal with symbols that represent labels. The last category of symbols that we have to deal with are symbols that represent variables. Now, as you recall, when you wrote programs in the hex-symbolic language, language. We have this, fantastic ability to create and use as many symbolic variables as we want. This is one of the most important abstractions in programming. And, someone has to pay the price, so to speak. Someone has to implement this abstraction. And as, as you can imagine, the agent that implements this obstruction is the assembler. How do we do it? Well according to the Hack language specification, any symbol that appears in the program which is not pre-defined and is not accompanied by another label declaration statement is considered a variable. We have two such variables in the example in front of us, and they're called i and sum. And the program, as you can see, begins with four lines of code that basically declare and initialize these two, these two variables to 1 and 0, respectively. It's not terribly interesting, but I'm just trying to add some color to this discussion. All right, so we know, how the programmer expresses variables, in the way that I just described. How do we handle these variables if we are the assembler? Well, each such variable is assigned a unique memory address, starting with 16. You may ask yourself, why 16? Well, this is a decision that Norm and I made when we, developed this, language, and this, assembler. And it's not exactly arbitrary decision, as you will see later on, but for now you can treat it as an arbitrary decision. So variables are assigned to memory from address 16 onward. All right, so with that in mind we have only two variables in this program, i and sum. And the values of these variables are going to be 16 and 17. Now, like any other symbol in a Hack program, the only context in which such symbol can come to play is in the context of an A instruction. So we, we may, we may see instructions like @variable name. We definitely we see them, will see them because otherwise why did we declare these variables to begin with. We want to act on these variables and we do it using A instructions followed by C instructions. So, whenever we see such, such an instruction, how do we translate it? Well, all you have to do is the following. If you see this instruction for the first time in the program, if this variable appears in the program for the first time, then you allocate it to a memory address, starting from address 16 onward. If you see this value popping up later in the program, you simply look up the value that you assigned to it before, and then what remains is an @decimal value instruction, which we already know how to handle. So this is how we handle symbols that represent variables. Now, I think that you will agree with me that handling all these different kinds of symbols is a major headache. And what we can we possibly do to make this task simpler? Well, exactly for this reason, computer scientists have invented an artifact called symbol table. The symbol table is a very simple and powerful data structure that enables you to store and use symbol value pairs, and I can populate the table with as many symbol value pairs as I please. So, when I write an assembler, I construct such an empty symbol table, and then I begin to populate it with all the symbols that I encounter in, in the program that I am supposed to translate. And that's also how I'm going to explain to to you the structure and the use of the symbol table. I'm going to do it constructively. I'm going to describe how we actually build the symbol table in order to deal with the example that we see here. And by the way, the same will hold for any other Hack program that you will, that you will see in the future. So how do you create this symbol table? Well, the first thing that you do is you, you construct an empty symbol table. And then you populate it with all the predefined symbols which are specified in the language. So in the case of the Hack assembly language we have 23 such pairs. Simply add them up to the table one by one. And you do this, by the way, before you even touch, the source program and before you start any translation. That's how you initialize the symbol table. Now what do you do next? The next thing that you do is you march through the entire text file that constitutes your source assembly program. And the only thing that you do is you look for label declarations. Look for lines of code that begins with left parenthesis. Once you have such line of code, you know that if the code was well written, if it contains no error, it must be the beginning of a label declaration. Now as you do this, you also keep track of how many lines you read so far. And, as I explained previously, you count only real instructions, skipping label declarations and white space. So, once you do this, if you have this count in mind, when you encounter LOOP, for example, you should know that LOOP corresponds to four. And then you go on, you ignore everything else until you hit the next label declaration which is STOP. And then you consult your counter, you see that stop corresponds to 18 and so on and, and so forth. Based on this scanning, you can continue to build the symbol table. And at the end of this process, you have added to the symbol table, all the symbols that represent go to destinations in your program, label symbols. Now, we call this process, first pass. So, the assembler that we are going to develop, is going to be a two pass assembly process. In the first pass, we extract from the program all the label symbols, and in the second pass, which I'm going to discuss next, we're going to extract all the variable symbols. So here's what we do in the next pass. Once we finish the first pass, we start once again, to scan the entire program from beginning to end. And whenever we see a label, or I'm sorry, whenever we see a symbol which does not appear in the symbol table, we know that it's a variable. So, we add it to the table, and we assign to it the values 16, 17, 18, 19, and so on, for as many variables as you have in the program. And when we encounter these variables later on in the program, we always look them up in the symbol table. If we find them in the symbol table, well, they are declared already and we can use them. If we don't find them, we can conclude that it's a new variable and we simply add it to the symbol table. That's the bates, the basic logic of constructing, and utilizing a symbol table. So how do you actually use it? Well, to resolve to resolve a symbol, you look it up, you look it up in the table. You extract the, or retrieve its value, put it into the instruction. And what you get is the meaning of this symbol according to the symbol table. Now before we go on, I'd like to emphasize that the symbol table is some sort of an auxiliary data structure that the assembler needs in order to carry out the translation process. Once we finish the translation process, we can throw away the symbol table. We don't need it anymore. So, we maintain it as long as the assembler is processing the program and then we can toss it away. All right. So, with that in mind we can now describe the overall assembly process. We have reached to a point where we can actually lay out the algorithm for according to which the assembler can be developed. So, here we go. First of all, we do some initialization. We construct an empty symbol table and we get ready to process the input file, we add the predefined symbols to the symbol table and then we go to work. We do a first pass, we go through the entire input file and, we look up or, or, we we s, we search for instructions that begin with the left parenthesis. And we add the pairs, xxx, address to the symbol table as we go along. Then we do a second pass. In the second pass, we take care of the variable declarations, as I explained earlier. And at the same time, if we have an instruction which is a C-instruction, we simply translate it into a binary code. And if we have another instruction with deals with the variable that has been used declared previously, we look up the table, we replace the values, and so on. We take the binary call that we generated, we write it into the output file, and that's it. We we have completed translating the program from symbolic to binary. Now, I have intentionally, I didn't read all the details of this algorithm. Because I discussed them in various different ways in the previous units and in this unit. So you're welcome to stop the video, consult this algorithm and conc, and convince yourself that it actually delivers the required translation task. So if you do all this and if you actually implement it in some programming language or on a piece of paper if you want or, you know, you can teach a, a human being to be at an assembler if you want. If you do all of the above, you have accomplished the the task of actually translating hack symbolic code into hack binary code. And that's the end of developing our assembler. In the next units, we're going to talk about actually building such an assembler using a programming language like Java or Python. Or if you're not a programmer, or if you don't have a previous knowledge of programming, we will also give you an option to develop an assembler without using a programming language. So stay tuned and we'll discuss this in the subsequent units.