So, regression is a somewhat complicated object. We've got a bunch of points, we've got a regression line. It's very nice if we can summarize things in a simple way. And when it comes to a linear regression, there are two very, very commonly used single number summaries of a regression. These two one number summaries are called R squared and Root Mean Squared Error and Root Mean Squared Error is often given the acronym RMSE. Now R squared is the number that measures the proportion of variability in Y explained by the regression model. It turns out simply to be the square of the correlation between Y and X, but it has a nicer interpretation, just a straight forward correlation. It is interpreted as the proportion of variability and Y explained by X. And so all other things been equal to 1, typically prefers a higher R squared over a lower 1, because you're interpreting more variability. RMSE is a different one number summary from a regression and what RMSE is doing for you, it's measuring the standard deviation of the residuals. The residuals remember are the vertical distance from the point to the lease squares or the fitted line and the standard deviation is the measure of spread. So, it's telling you how much spread there is about the line in the vertical direction and I would often informally call that the noise in the system and so RMSE is the measure of the noise in the In the system. What I've shown you in the table at the bottom of this slide are the calculations of R squared and RMSE for the three datasets that we've had a look at. There was the diamonds dataset, the fuel economy and the production time dataset. And if you look at R squared, it's frequently reported on a percentage basis. We've got a 98% R squared for the diamonds dataset, that's because there was a very strong linear association going on there. In the fuel economy dataset, it was 77%. And in the production time dataset, it was only sitting there at 26%. Now, one of the things you have to be careful about with R squared is that there is no magic value for R squared. It's not as even R squared has to be above a certain number for the regression model to be useful. You want to think much more about R squared as a comparative bench mark as opposed to an absolute one. So if I'm comparing two regression models for the same data, all other things being equal. I'm going to typically prefer the one with the higher R squared, but all because I've got a model with an R squared of say, 5% or 10% doesn't necessarily mean that that model isn't going to be useful in practice, but it is a useful comparison metric. Now the other number, Root Mean Squared Error, I've calculated it for the three examples here. And it's 32, 4 and 32, somewhat coincidentally for the production time dataset. Now, one key difference between R squared and RMSE are the units of measurement. So R squared, because it's a proportion, actually has no units associated with it at all. So it's easier to compare R squared in that sense where as RMSE certainly, because it's the standard deviation of the residuals and the residuals are distance from point to line in vertical direction. Vertical direction is the Y variable direction. So RMSE has the units of Y associated with it. So for the diamonds dataset, that RMSE of roughly 32, that's 32. You can say, $32. And for the fuel economy, RMSE is 4.23. It's 4.23 gallons per thousand miles in the city to be formal about it. And the 32 for the production time, that's 32, an RSME of 32 minutes. So these two one number summaries are frequently reported with a regression. Most software will calculate them automatically as soon as you run your regression model, for example, within a spreadsheet environment. And all other things being equal, we like higher values of R squared. We're explaining more variability and we like lower values of Root Mean Squared Error. If there's a low standard deviation of the residuals around the regression line, that's tantamount to saying that there are residuals are low, they're small and the points are therefore, close to the regression line, which is what we like. So those are the two one number summaries that accompany most regression models. Now perhaps, the most useful thing you can do with a Root Mean Squared Error is to use it as an input into what we call a prediction interval. So remember that when you have uncertainty in a process, you don't just want to give a forecast, you want to give some range of uncertainty about that forecast. That's just so much more useful than practice. And with suitable assumptions, we can tie in Root Mean Squared Error to come up with a prediction interval for a new observation. So here's our assumption, we're going to assume that at a fixed value of X. The distribution of points about the true regression line follows a normal distribution. So, another module has discussed the normal distribution. This is one of the places where normality assumptions are very common in a regression context and what we're assuming is that the distribution of the points about the true regression line has a normal distribution. We'll talk about checking that in just a minute, but let's work with that as an assumption. Furthermore, that normal distribution is centered on the regression line. So you can see in the graphic on the page, the assumption being shown to you. So we believe there's no data here, because we're positing a true model, so to speak. So there's a true regression lying there at any particular value. Let's take the left hand normal distribution. Let's say, we took lots and lots of diamonds that weighed 0.15 of a carat. What do we expect their distribution to look like around the regression line? We expect the distribution of the prices to be normally distributed with the center of the normal distribution sitting on top of the regression line and we believe that's true for any value of X. That's why one of the standard assumptions of a regression wall involves. Furthermore, we're going to assume that all of these normal distributions around the true line have the same standard deviation. That's often termed the constant Variance assumption and with that assumption, we can estimate that common variance, the spread of the points about the line in the vertical direction with RMSE. So RMSE will be our estimate of the noise in the system and with this assumption of normality, it's estimating the standard deviation associate with the normal distribution that captures the spread of the points around the true regression line. So on this slide, I've introduced an important assumption behind regression. That of the normality of the points about the regression line. Now, we know about root mean squared error as an estimate of the points about the regression line. And furthermore, we believe at least we assume that that spread is normally distributed. What can we do with that information? Well, here is what we do with it. We can put that information together to come up with what we determine approximate 95% prediction interval for a new observation. So I'm going to present to you a rule of thumb that comes out of a regression, but you've gotta be careful with this rule of thumb. You can only use it within the range of the data. So if you are extrapolating forecasts outside of the range of the data, don't use this rule of thumb. But at least within the range of the data, it's extremely useful. And with the Normality assumption and overlaying the Empirical Rule, which was discussed in a separate module. Then within the range of the data, an approximate 95% prediction interval for a new observation. So the idea is that somebody comes to me with a new diamond. A diamond that wasn't used in the calculation of the regression line, they got a new diamond, they give it to me. They say, it weights 0.250 carat. What do you think it's going to go for? What do you think the price is going to be? I could use the prediction interval to give a range of feasible values. The 95% prediction of all is forecast, which means go up to the regression line and read off the value and then plus or minus twice the Root Mean Squared Error. And that plus or minus twice the Root Mean Squared area is coming straight at of the Empirical Rule, the 2 is coming because we want a 95% prediction interval and the RMSE is our estimate of the standard deviation of the underlying normal distribution. So this interval really captures one of the key goals of a regression, which is to provide uncertainty with our forecast. Not just a forecast, but uncertainty range associated with that forecast. So with the normality assumption and Root Mean Squared Error, you're in the position at least within the range of the data to get a sense of the precision of forecast coming out of a model. So let's have a look at that idea for the diamonds data set. For the diamonds dataset, the RMSE was equal to 32 and with the Normality assumption that says, at least within the range of the collective data four diamonds that are similar to the set that were used in the regression analysis. The width that with an approximate 95% prediction interval for a new observation is plus or minus twice the root means squared area. 2 times 32 is 64, so this model is able to price diamonds using a 95% prediction interval to within about or minus $64. That's the calculation that is done at the bottom of the slide and working it out, if a diamond weighs 0.25 of a carat and I put 0.25 of a carat into the regression equation. That's the -260 + 3721 x 0.25. That's my forecast or prediction and then I do plus or minus twice the root mean square error, which here is plus or minus 64 and I get a range of feasible values. Somewhere between $606 and 734. And I say that really captures the essence of what these probabilistic models are able to do for you that you couldn't do with a deterministic model, a range of uncertainty. So there's the 95% prediction interval. So we've now seen a 95% prediction interval. Remember that it relied on a normality assumption for the noise in the system, for the spread of the points about the regression line. We're assuming that was normally distributed. Now when you make assumptions, part of the modeling process should be to think carefully about that those assumptions and make a call on whether or not they seem reasonable. So always check your assumptions, if you can. Now one way that I could check this Normality assumption is to take the residuals from the regression. Those residuals are then remember are the vertical distance from the point to regression line and have a look at those residuals and see whether or not. They seem to at least approximately follow a normal distribution and that's what I'm showing you on this particular slides. I've saved all of the residuals from the regression and I've plotted them using a histogram. The histogram displays a shape for the distribution of the underlying raw data, we assumed that these the spread of the points throughout the regression line was a normal distribution. Let's have a look and see what actually happens in practice. So I look at this distribution and I have to say, it's approximately bell-shaped. It's approximately normally distributed. And the way that I would articulate this is that I would say that by looking at the residuals, I see no strong evidence against the normality assumption. I'd be very reticent to say, yes, there's a perfect normal distribution going on here, because I actually never believed that models are exactly right. But I do believe that some of them are useful and I can validate that usefulness by checking assumptions and it's an incredibly important part of the model building process to state and then check you assumptions. And so some times, we call that process of checking the assumptions diagnostics. So when possible, you should look at model diagnostics. If I assume Normality, then let me have a look at the appropriate data. The appropriate data here being the residuals and accessing whether or not Normality appears to be a reasonable assumption. And in summary by looking at this histogram, I see something that is pretty bell-shaped, approximately Normally distributed. So I don't feel that there's any strong evidence against the Normality assumption, so I would proceed with it.