To finish out the reinforcement learning formalism, instead of looking at something as complicated as a helicopter or a robot dog, we can use a simplified example that's loosely inspired by the Mars rover. This is adapted from the example due to Stanford professor Emma Branskill and one of my collaborators, Jagriti Agrawal, who had actually written code that is actually controlling the Mars rover right now that also helped me talk through and helped develop this example. Let's take a look. We'll develop reinforcement learning using a simplified example inspired by the Mars rover. In this application, the rover can be in any of six positions, as shown by the six boxes here. The rover, it might start off, say, in disposition into fourth box shown here. The position of the Mars rover is called the state in reinforcement learning, and I'm going to call these six states, state 1, state 2, state 3, state 4, state 5, and state 6, and so the rover is starting off in state 4. Now the rover was sent to Mars to try to carry out different science missions. It can go to different places to use its sensors such as a drill, or a radar, or a spectrometer to analyze the rock at different places on the planet, or go to different places to take interesting pictures for scientists on earth to look at. In this example, state 1 here on the left has a very interesting surface that scientists would love for the rover to sample. State 6 also has a pretty interesting surface that scientists would quite like the rover to sample, but not as interesting as state 1. We would more likely to carry out the science mission ant state 1 than at state 6, but state 1 is further away. The way we will reflect state 1 being potentially more valuable is through the reward function. The reward at state 1 is a 100, and the reward at stage 6 is 40, and the rewards at all of the other states in-between, I'm going to write as a reward of zero because there's not as much interesting science to be done at these states 2, 3, 4, and 5. On each step, the rover gets to choose one of two actions. It can either go to the left or it can go to the right. The question is, what should the rover do? In reinforcement learning, we pay a lot of attention to the rewards because that's how we know if the robot is doing well or poorly. Let's look at some examples of what might happen if the robot was to go left, starting from state 4. Then initially starting from state 4, it will receive a reward of zero, and after going left, it gets to state 3, where it receives again a reward of zero. Then it gets to state 2, receives the reward is 0, and finally just to state 1, where it receives a reward of 100. For this application, I'm going to assume that when it gets either state 1 or state 6, that the day ends. In reinforcement learning, we sometimes call this a terminal state, and what that means is that, after it gets to one of these terminals states, gets a reward at that state, but then nothing happens after that. Maybe the robots run out of fuel or ran out of time for the day, which is why it only gets to either enjoy the 100 or the 40 reward, but then that's it for the day. It doesn't get to earn additional rewards after that. Now instead of going left, the robot could also choose to go to the right, in which case from state 4, it would first have a reward of zero, and then it'll move right and get to state 5, have another reward of zero, and then it will get to this other terminal state on the right, state 6 and get a reward of 40. But going left and going right are the only options. One thing the robot could do is it can start from state 4 and decide to move to the right. It goes from state 4-5, gets a reward of zero in state 4 and a reward of zero in state 5, and then maybe it changes its mind and decides to start going to the left as follows, in which case, it will get a reward of zero at state 4, at state 3, at state 2, and then the reward of 100 when it gets to state 1. In this sequence of actions and states, the robot is wasting better time. So this maybe isn't such a great way to take actions, but it is one choice that the algorithm could pick, but hopefully you won't pick this one. To summarize, at every time step, the robot is in some state, which I'll call S, and it gets to choose an action, and it also enjoys some rewards, R of S that it gets from that state. As a result of this action, it to some new state S prime. As a concrete example, when the robot was in state 4 and it took the action, go left, maybe didn't enjoy the reward of zero associated with that state 4 and it won't have any new state 3. When you learn about specific reinforcement learning algorithms, you see that these four things, the state, action, the reward and next state, which is what happens basically every time you take an action that just be a core elements of what reinforcement learning algorithms will look at when deciding how to take actions. Just for clarity, the reward here, R of S, this is the reward associated with this state. This reward of zero is associated with state 4 rather than with state 3. That's the formalism of how a reinforcement learning application works. In the next video, let's take a look at how we specify exactly what we want the reinforcement learning algorithm to do. In particular, we'll talk about an important idea in reinforcement learning called the return. Let's go on to the next video to see what that means.