In this unit, we're going to describe arc, a proposed, a architecture for an assembler. You've already seen the basic algorithm, the basic logic of what an assembler needs to do, but now we need to translate that into a real program and think about its architecture. Now the truth is an assembler is a pretty simple program. And those of you who've already had significant programming experience can probably ignore this unit very safely and just go ahead and write the assembler. For those of you that has likely less experienced programming, you may want to look and remember our suggestions for reasonable architecture, reasonable components of an assembler. Again, this is only a recommendation, you may want to look at the recommendation and then do whatever you feel most comfortable with. So let us see what is the types of modules that probably will be useful when you write an assembler, the type of operations that need, certainly need to be done. Well, three of them can be very easily identified, so let's talk about them. One of them is basically the parser, something that is able to read a file and get the different commands in it, and breaks them into parts. That's the first part reading and parsing command, you will need to do that, that's one thing. Another a, part of the code probably will have to actually understand how you convert a mnemonic, an assembly language command or part of an assembly language command into actual code. So that we'll have to understand the mapping and that's a different another part that we'll need to be done somewhere probably in your, an assembler. And of course, there's another part that we'll need to be able to handle symbols, the symbol table itself. So let's talk a bit about these three different components and once we talk about them, you'll see that the rest of the program using these different components, which we think about as classes. Eh, and, writing the complete program, there is not much left to dap, to be be done. So let's start with the first part, reading and parsing command. The most important thing that we have to, to to think about, what we don't need to worry in this part. So the whole point of the parser, of the component that reads in parse command, it, it only needs to be able to read the input and break it into its parts. It doesn't need to understand anything. It doesn't need to understand what the commands mean. It doesn't need to understand how they're translated to machine language. It doesn't need to understand the symbols or their addresses. The only thing that needs to understand is the format of the input language. And how it breaks into different components. So let us now see what are the different things, what are the different actions that it does need to worry about. So let us see what are the things it does need to do. Well, first of all, it needs to be able to read a file with a given name. So, it needs to start being able to start reading a file. So, in particular, for example, you may want, if you implement it as a class in some programming language, you may want a constructor to actually eh, be able to accept the file name and then open the file for reading. So, you will need to know how to handle text files, how to read text files in the languages that you're implementing. [INAUDIBLE] The second thing is once you've started reading a file, you should be able to, every time, get the next command in the file. So getting the next command, if you look at it, what does it mean in terms of methods, operations that this part would have to do? Well you'll probably first have to know whether you finished the file already, are there more commands or have you reached the end of the file? And then you'll need to be able to read into some kind of line, into some kind of string, the next command in the file. So the kinds of things, so what that entails is basically you will need some kind of string eh, capability, string handling capability and you will need to know how to read line by line or something like that in your language. The third thing you will have to know how to do, is to basically break the command that you just read into its components. First of all, you need to know what kind of command is it. In our language, we have A-commands and we have C-commands And also, there are pseudo-commands if you wish, that define labels. These are not really commands in the sense of being translated to machine language, but for assembler, it will have to basically know that here we have a label definition, so we couldn't handle it. So that will be another, third if you wish, pseudo-command, that our assembler will have to know. Once we know that, we will probably want to give the rest of the program easy access to the different parts of the command. So for example, in the assignment command, we'll probably want to give access to the destination part of the command, to the computation part of the command, and to the jump part of the command. Similarly, in the label or an A-command, we'll just want to give our user access to the actual string, which may be a symbol and may be an actual number that provided. For example, the following set of a, a, of method, if you wish, in a Java or C-like, in a Java-like language, probably would give you this type of access. Eh, but eh, of course the whole point, the important thing is that you get the pieces of information separately and, in any way that you wish that one wants to provided. The next thing that we have to do, is the, is translate each one of the components of the command into the actual ma, machine code into the binary code. Now we call that the specif, specification of the language basically, we have each part of the assembly language command has a separate part inside the bits of the machine language command that correspond to it. And this is basically what we need to do this translation. Now again, the important point is what we don't need to worry about. We don't need to worry about it all. The, the way that our mnemonics, the pieces of this code of, of the command were obtained. We only get a destination, for example, D. Or a computation, for example, M plus 1, and need to be able to translate that very short string into the code. We don't need to worry about how we got it, what part of the input line it is and so on. So this is a very low, this is a very well-defined and small task that we need to do now, and the way to do it was really already specified. How do we, for example, translate the destination part into the three bits that specify it? Well there was a little table that already appeared in the previous lecture that specified that translation. For everything that can happen there, for every string that can be part of the destination, we know what are the bits are. So we just need to be able to do this translation that's written in the table. Similarly, there was another table to specify the jump location. And yet another table, the biggest one, to specify how you take the computation and translate it into the seven bits, one of them called A and the other one called C, that specify the computation in the machine language. And recall again, that our machine language that is a C-command actually starts with 31 bits. So we also need to put the 31 bits, and voila, we have our, our machine language code. So let us use an, see an example how to use this parser eh, object, and the code object to actually do that translation. So our parser object, presumable eh, gives us access to each one of the parts, for example the computation destination and jump parts, so we basically take each one of them and ask the parser to give us the string that corresponds to each part of these pieces. Once we have these three different strings, we actually go to the code object and ask them to translate each one of them separately according to the table it has in it. So now, each part of these now we have the machine code for each one of these parts. To put them together, we simply concatenate all of them together with the three ones that appear on the left. And here we have a simple piece of code that actually gives us the complete translation of our original string that held an assembly language command into a string that now holds binary bits. Let us now move into the third part that we've already identified, the symbol table. The actual table that keeps the association between symbol names and their addresses. Again, the important thing is what the symbol doesn't have to understand, it, it doesn't need to understand anything about the machine language or assembly language about what the symbols mean. The only thing that, the only thing that our symbol table will have to understand is maintain association between a symbol and a memory address, that is it, without understanding anymore than that. So, in particular, what kind of the operations that must our symbol table be able to handle? Well, we probably need to create a new empty symbol table. We need to be able to add assemble address pair to the table, we need to able to look up into the table and see, does this symbol exist there and if so, what is the address there? And these are the operations that basically we need to, to be able to do. So eh, that's very, very simple and most languages already have a class that does this kind of a symbol table for us, and we'll probably just need to use it. But let us look a little bit, how are we going to use the symbol table? So, here is the basic logic of handling the symbols in our program. We start, we will probably want to start with creating an empty symbol table. Then, if you remember in the hack machine language, in the hack assembly language, there are a bunch of pre-defined symbols. Such as keyboard and so on. So, we'll probably, the first thing we'll probably just want to do in our program is add all these pre-defined symbols to the table. Then, as we go reading the program and translating it, we'll probably want to eh, add labels and add variables to the, a symbol table whenever we see it. And then when you, we already have this updated table all the time in our program. Whenever we see an A-command with a symbol in it, we basically look up the new symbol, look up the symbol in the table and have a direct translation to a number that then we can keep on eh, handy. So the only thing that we probably need to look a little bit more carefully is how do we put the symbols into the table. How do we put the labels and the variables into the table. So let us look at that piece a little bit eh, more carefully. So in the case of a label, whenever we see a label command inside parenthesis then we probably need to add set label, to add a symbol, into the, into the symbol table. And what is the address that's associated with it? We probably need to know, we need to remember what is the current address that we're going to put the next command in the file. And that is the address that's going to be associated with it, and this is what we need to put to add to the symbol table. Now two comments are in place here. Eh, one of them is, remember that this, we need to know where we are in the program, so we need to keep on maintaining a running line number, if you wish, in the table specifying what is the address that every new address is going to be put into. Because this is exactly what is going to tell us, what is the current address. The second eh, comment is that you re, if you recall eh, we probably want to have a first pass, because there are could be forward eh, references to labels. We probably want to have a first pass and enter all the labels in a first pass into the symbol table before we actually start using them. Now for variables, the situation is a slightly different. Whenever we see another a, label, another A-command with a new variable that we don't recognize yet, then first of all we know it's not the label because the labels were already all entered in the first part. Then, if it's not in the symbol table, it means that we have a new variable name. And then we need to allocate it the next available address. And recall that the next, the address that we allocate the variables start at the number 16. And then keep on going 17, 18, and so on. So whenever we see a new, a, a new symbol in an A-command, what we'd need to do is allocate a new place for that new symbol, enter, and enter it into the table. So up to this point, we've described three basic components that our assembler will have to use. And now let's see how do we use them. What is the overall logic of our assembler? Well, we started by initial, initializing, we need to initialize our parser to start opening the file, we need to initialize the symbol table to be empty, and to start eh, having in it all the pre-defined symbols, and so on. And then we need to have our first pass where you go after the program, only pay attention to labels, and enter them into the symbol table. Then we have to restart reading the program from the beginning, read the command and translate it. Read the command and translate it. As we do that, we need to keep on entering new symbols into the table, only variable symbols now because already all the label symbols were already entered in the first pass. And then our main loop is very simple. You read one, you, you read a command from the input. If it's an A-command and we have a, and we have a la, we have a symbol, we need to translate that symbol into an address as we've already described. If it's a C-command, we need to split it into the three parts that a C-command has, translate each one of them into binary code. And then we have the binary code to output for the, to, to output as our output for that line in the file. And we keep on doing that until the program is over. And that's basically a simple piece of code, left to right, after we have the three models that we've previously described. So, what the, that finishes the basic architecture, our, our suggestion for basic architecture of an assembler. And what we're going to describe in the next unit is the basic mechanics of doing the project of how to test it, how to submit and so on.