Lesson 6.2: Simple Linear Regression

Supplementary Notes 6.2

Linear Regression Model

In Lesson 6.1, we measured the strength of a linear relationship for a scatter of points that was linear in form. In this lesson, we go further and develop methods for finding an equation for the line that “best” fits the scatter. We’ll then use this line as a model (called the linear model) to predict values of the response variable (y) from values of the explanatory variable (x). The linear model in this context is also known as a Least Squares Regression model or a Simple Linear Regression (SLR) model.

Scatterplot best fit line
Figure 1: Scatterplot showing “best fit” line

Least Squares Interpretation of “Best Fit”

In a real-life statistical application, the data points in the scatterplot will seldom, if ever, line-up perfectly along a straight line. So, the question becomes: How should we choose the line through the scatter that in some sense fits the data best?

The least-squares criterion is the method most used for choosing the best fitting straight line. The least-squares criterion says to choose the line that makes the sum of the squared residuals as small as possible. A residual is just the difference between the observed y-value (y) and the corresponding predicted y-value (\hat{y}, (pronounced “y-hat”) on the line.

3-point scatterplot
Figure 2: The scatterplot has three points (residual 1, 2, and 3) with a linear line of “best fit” between the points. 

For the simple three-point scatterplot in Figure 2, the least squares criterion says to choose the line that makes (Residual 1)2 + (Residual 2)2 + (Residual 3)2 as small as possible.

In symbols, the least squares criterion says to choose the values of b0 and b1 for the equation \hat{y}=b_0+b_1x (called the least squares line) that makes \sum(y-\hat{y})^2 as small as possible.

Statistical software like jamovi can automatically calculate the y-intercept, b0, and the slope, b1, for the least squares line as illustrated in the following example.

Example: Find the Least Squares Regression Line Using jamovi

When the least squares criterion is used to determine the line of best fit through a scatter of points, the line is usually called the regression line (for historical reasons).

With jamovi, it is easy to find the equation of the regression line for a dataset like the data on used Toyota Corollas we considered in Supplementary Notes 6.1.

  • Download the data from corolla [CSV file] and open it in jamovi.
  • Select the Data tab and double-click the header for each variable column to change the Measure type for each of the variables from Nominal to Continuous.
  • Select Analyses > Regression > Linear Regression.
  • Move price to the Dependent Variable box.
  • Move age to the Covariates box.

The jamovi software returns the following output (Fig. 3) along with additional output we’ll consider later:

simple linear regression - corolla data
Figure 3: Simple linear regression in Toyota Corolla price

The numbers in the “Estimate” column gives us the y-intercept and slope for the regression line for the Corolla dataset: \widehat{price}=14507-785\,age.

This regression line gives us a predicted price of a ten-year-old Corolla as \widehat{price}=14507-785(10)=$6657.

There is some rounding error in this calculation, which we can avoid by using jamovi to predict the price directly. In the Linear Regression dialog, click Save and select Predicted values. Then click the Data tab to find the predicted price of a ten-year-old Corolla to be $6655.595 (without rounding error).

The one ten-year-old Corolla in the dataset has a price of $7300, so the regression line is predicting a slightly lower price. Perhaps this suggests that the price for this Corolla is a little higher than for a typical ten-year-old Corolla in this market.

For a seven-year-old Corolla, the regression line predicted price is \widehat{price}=14507-785(7)=$9012. To find the predicted price without rounding error, go to the Data tab and type “7” into the first empty cell for the age variable. Then jamovi should fill-in the predicted price in the corresponding cell for the Predicted values column: $9011.143.

Drawing the Regression Line Using jamovi

  • Select Analyses > Exploration > scatr > Scatterplot.
  • Move the variable age to the X-Axis box and price to the Y-Axis box.
  • Under Regression Line select Linear.

The jamovi software draws a scatterplot with age along the x-axis and price along the y-axis, and the regression line drawn through the point cloud:

scatterplot with regression line - corolla data
Figure 4: Scatterplot with regression line for Toyota Corolla data

Interpreting the Slope and y-Intercept of the Regression Line

What does the slope of the regression line tell us?

The slope of the regression line tells us how much the y-variable changes (on average) for a one-unit increase in the x-variable. The units for the slope are “y-units per x-units.”

In the Corolla example, the regression line model is \widehat{price}=14507-785\,age.

The slope of –785 “dollars per year” tells us that the price of a used Toyota Corolla drops $785 (on average) for each additional year of age for vehicles aged between one and 17 years. Of course, the $785 decrease in price per year is not a certainty. It is just a reasonable prediction obtained from the regression model.

What does the y-intercept of the regression line tell us?

The y-intercept of the regression line gives us the predicted value of the y-variable corresponding to an x-variable value of 0. But we must be careful here since in a particular application, a value of x = 0 may fall outside the x-value data range, and we shouldn’t use the regression line to make predictions outside the x-value data range.

In the Corolla example, the y-intercept is $14,507 and would represent the predicted price of a new Corolla. But notice that the Corolla dataset started at x = 1, so a prediction at x = 0 is extrapolating outside the x-values of the dataset. This is a questionable prediction since the relationship between price and age may be different for new Corollas.

An example of a more extreme invalid extrapolation would be to use the regression line to predict the price of a 20-year-old Corolla: \widehat{price}=14507-785(20)=$-1193.

Clearly the negative predicted price of –$1193 is absurd (and very insulting for owners of fine 20-year-old Corollas!).

For the Corolla data, the regression model was built for ages (x-values) between one year and 17 years. Using this model to predict prices for ages outside this interval is risky (and likely inappropriate).

Finding the Least Squares Regression Line by Hand

In practice, statistical software like jamovi is used to find the equation of the regression line, but hand calculation using the least squares formulas for the slope and y-intercept can give us some insight into how the regression line works.

  • Least Squares Regression Line: \hat{y}=b_0+b_1x.
  • Slope: b_1=r\dfrac{s_y}{s_x}.
  • y-Intercept: b_0=\overline{y}-b_1\overline{x}.

Example: Run Time by Age

The scatter in Figure 5 suggests a linear relationship between the age of a female athlete and the time to run 10 km.

scatterplot - 10k run times
Figure 5: Scatterplot shows the run time (minutes) by age (years) of female athletes 
Figure 5: Scatterplot shows the run time (minutes) by age (years) of female athletes (new)

Here are the summary statistics for this dataset:

  • Age mean: \overline{x}=32.33 years
  • Age SD: s_x=5.92 years
  • Run time mean: \overline{y}=34.00 mins
  • Run time SD: s_y=2.90 mins
  • Correlation: r=0.92

Given the above summary statistics, it is easy to find the equation of the regression line.

  • Slope: b_1=r\dfrac{s_y}{s_x}=0.92\dfrac{2.90}{5.92}=0.45
  • y-Intercept: b_0=\overline{y}-b_1\overline{x}=34.00-0.45(32.33)=19.45
  • Regression Line Equation: \widehat{runtime}=19.45+0.45\,age

This regression line model predicts that the 10 km run time for female athletes in this age range increases on average by 0.45 minutes for each year of age (between 25 and 40 years old). In this example, the y-intercept has no natural meaning since it is ridiculous to predict a run time for an athlete of age 0!

Insight into the Least Squares Regression Line

Let’s use this run time by age example (above) to get some insight into the least squares regression line.

regression line - run time example
Figure 6: A scatterplot for run time (minutes) by age (years) data: (solid black line) regression line; (dashed line) perfect correlation (r = 1) line

For this example, the regression line is \widehat{runtime}=19.45+0.45\,age and its graph is shown as the solid line in the scatterplot (Fig. 6). Notice that it goes through the point (32.33, 34.00).

What’s special about this point?

Look back at the summary statistics to see that 32.33 is the mean age, \overline{x}, and 34.00 is the mean run time, \overline{y}. The point (32.33, 34.00) = (\overline{x}, \overline{y}) is called the point of averages. It’s not a fluke that this regression line goes through the point of averages. The least squares regression line will always pass through the point of averages (\overline{x}, \overline{y}). If you are algebraically inclined, you could prove this!

What else can we learn about the regression line from this example?

It’s tempting to think that if we move 1 SD up from \overline{x} on the x-scale to 38.25 (32.33 + 5.92), the line would go up 1 SD up from \overline{y} on the y-scale to 36.90 (34.00 + 2.90). But does it?

Check it out: at x = 38.25, the regression line predicts a run time of 19.45 + 0.45(38.25) = 36.66 minutes. This is close, but a little less than 36.90 minutes.

Why the discrepancy?

The answer lies in the formula for the slope of the regression line, b_1=r\dfrac{s_y}{s_x}. It says that for a “run” of s_x on the x-scale, the predicted y-value “rise” is only r times s_y on the y-scale. Since r = 0.92, the predicted vertical rise is slightly less than 1 SD on the y-scale. Since the correlation between the two variables is not perfect, we hedge a little bit when we make predictions away from the point of averages. The dashed line (Fig. 6) represents the perfect correlation (r = 1) line. The regression line (solid black line) is tipped down a little from the dashed line by the factor r (here r = 0.92).

Summary

The least squares regression line always goes through the point of averages. (\overline{x}, \overline{y}), and for each s_x change on the x-scale, the predicted change on the y-scale is r\times s_y.

“Reading” the Residual Scatterplot

Sometimes it’s hard to say whether it is reasonable to fit a simple linear regression model for a particular dataset represented in a scatterplot. In many cases, the regression line is first calculated, and then a new scatterplot is created for the residuals. This new residual scatterplot is then used to determine whether the linear model assumptions are appropriate. For the linear model assumptions to be appropriate, the residual scatterplot should have no pattern at all to it. If it does have a pattern, then the linear model assumptions are questionable. Remember that the residuals are calculated as residual = observed y-value − predicted y-value = y-\hat{y}.

Here’s how to draw a residual scatterplot using jamovi for the used Toyota Corolla dataset.

Drawing a Residual Scatterplot Using jamovi

In the Linear Regression dialog, click Assumption Checks and select Residual plots. The jamovi software creates three scatterplots:

  • Residuals on the vertical axis and fitted values (predicted values) on the horizontal axis (Fig. 7).
    residual plot - corolla example
    Figure 7: Residual scatterplot for used Toyota Corollas
  • Residuals on the vertical axis and the response variable (price) on the horizontal axis. Note: Ignore this plot; it is not relevant!
  • Residuals on the vertical axis and the predictor variable (age) on the horizontal axis (Fig. 8).
    residual plot - corolla example
    Figure 8: Residual scatterplot for used Toyota Corollas

The first (Fig. 7) and third (Fig. 8) residual plots produced by jamovi convey the same information since the fitted values are directly proportional to the predictor variable (age). So, in practice we only need to look at one of these residual plots for simple linear regression. For example, the residual scatterplot in Figure 8 with age on the horizontal axis is not strongly showing any apparent pattern, so the linear model seems appropriate for this dataset. There is one point with a residual close to 3,000, which lies a little way away from the other points. However, it’s not so far away as to be particularly worrisome.

Interpreting R2

The appropriateness of using a linear model can be judged by looking at the residual scatterplots as just discussed. But a linear model that is judged appropriate (because the residual scatterplots have no strong patterns) may or may not fit the data well. It all comes down to the amount of variation the data exhibit about the regression line.

scatterplots - strong and weak relationships
Figure 9: Two scatterplots: (left) strong relationship between points; (right) weak relationship with more variation between points

How should we measure the strength of the linear model?

Hopefully the correlation coefficient r springs to mind! The correlation coefficient r can certainly be used, but in the regression line context, we usually use the square of r (r2) because it tells us the percentage of the variation in the response variable (y) that has been accounted for by the linear model. Usually, r2 is written as R2 and is expressed as a percentage.

Example: Price of Used Toyota Corollas

You’ve likely already noticed that R2 automatically appears in the output whenever you do a linear regression using jamovi:

model fit measures - corolla example
Figure 10: Model fit measures for used Toyota Corollas: R2 is shown in right column. 

Thus, for the Corolla linear regression, R2 = 0.913 or 91.3%:

This tells us that about 91.3% of the variation in Corolla prices can be explained, or is accounted for, by the linear model taking the age of the car into account. Said another way, differences in the age of the car account for about 91.3% of the variation in the prices. Given the lack of strong patterns in the residual scatterplot and the relatively high value of R2, the linear model strongly fits this dataset.

Example: Run Time by Age

For the run time example, R2 = 0.8475 or 84.75%. About 84.75% of the variation in the run times is accounted for by the linear model taking age into account. The linear model quite strongly fits this dataset.

Simple Linear Regression With a Categorical Predictor

Up to now, we’ve discussed the Simple Linear Regression (SLR) model in the context of a numerical response variable (y) and a numerical explanatory (or predictor) variable (x). It is also possible to incorporate a categorical predictor variable by using a “binary indicator variable.” Section 8.2.8 in the textbook works through an example of SLR with a categorical predictor.

Common Pitfalls in Linear Regression

So far, we have focussed on the mechanics of linear regression. Computers make the mechanics of finding the regression line equation for a given dataset a relatively simple task.

Computers have been programmed to find this “best fitting” linear model (in the least-squares sense) for the data, but is this model of any value? Hopefully it is, but it’s wrong to automatically assume that it is. There are circumstances where the “best fitting” linear model is at best misleading and at worst totally inappropriate.

The tricky job of correctly using and interpreting the calculated regression line requires a little regression “wisdom.”

Here’s a summary of the important “regression wisdom” topics and pitfalls to be familiar with:

  • Pattern changes
  • Extrapolation
  • Summary value regressions
  • Causation
  • Non-linear models
  • Outliers
  • High leverage points
  • Influential points

There are no new formulas or calculations for these topics, but to acquire the desired level of “regression wisdom,” it’s important to carefully go through the examples and exercises both here and in the textbook.

Pattern Changes in Scatterplots

Consider doing a correlation and regression line calculation for the BC birth rate data in Table 1:

Table 1 Birth Rate in British Columbia
Year BC Live Births
per 1,000 population
Year BC Live Births
per 1,000 population
Year BC Live Births
per 1,000 population
1950 23.85 1968 16.82 1986 13.88
1951 24.10 1969 17.18 1987 13.64
1952 24.75 1970 17.32 1988 13.75
1953 25.44 1971 15.95 1989 13.63
1954 25.44 1972 15.42 1990 13.78
1955 25.44 1973 14.92 1991 13.44
1956 25.91 1974 14.92 1992 13.27
1957 26.14 1975 14.91 1993 12.87
1958 25.73 1976 14.53 1994 12.72
1959 25.51 1977 14.71 1995 12.34
1960 25.04 1978 14.72 1996 11.84
1961 23.69 1979 14.95 1997 11.21
1962 22.97 1980 15.19 1998 10.72
1963 22.06 1981 15.19 1999 10.36
1964 20.57 1982 15.40 2000 9.97
1965 18.74 1983 15.30 2001 9.90
1966 17.35 1984 15.47 2002 9.70
1967 16.91 1985 14.38 2003 9.72
Data source: (British Columbia Vital Statistics Agency, 2003)
  • Regression line: \widehat{livebirths}=627.253-0.3088\,year
  • Correlation: r = 0.9404
  • R2 = 0.8844

Looking only at the relatively high values for the correlation coefficient and R2, we could be fooled into thinking that the linear model is appropriate for this data. If we then used the regression line to predict the birth rate for the year 2010, we get 627.253 − 0.3088(2010) ≈ 6.57 live births per 1,000 population. This prediction seems a little on the low side.

Why should we be sceptical about the accuracy of this prediction?

  1. We have jumped into a linear regression calculation without first looking at the scatterplot to see if the linear model is appropriate.
  2. The year 2010 is outside our dataset, so we are extrapolating. Risky!

Figure 11 shows the scatterplot for this data.

scatterplot - bc births
Figure 11: Scatterplot of births per 1,000 population by year in BC

What patterns do we see in the scatterplot?

There is an overall downward trend to birth rate, but there are many sub-patterns in this dataset:

  • A “baby boom” in the 1950s
  • A rapidly falling birth rate through the early 1960s
  • A levelling-off in the 1970s through the mid-1980s
  • A slow downward trend beginning around 1985
  • A possible levelling-off in the 2000s

The scatterplot tells the story. A single overall linear model is not reasonable for this dataset: Each sub-era should get its own model.

So, what was the actual live birth rate for 2010?

It was 9.64, barely less than in 2003, and much higher than the value we got from the linear model (6.57, see the calculation above), based on data from 1950–2003.

Risks of Extrapolation

For many datasets, it’s natural to want to make predictions for x-values outside the actual x-interval of the dataset. Such predictions are called extrapolations. Extrapolations can be risky. Why? The basic risk is that there could be a fundamental change in the model as we move outside the x-interval of our dataset.

extrapolation
Figure 12: Extrapolations outside of the data set’s x-interval points

For example, suppose for the BC birth rate data someone back in 1957 tried to predict the birthrate for 1965 using a regression line based on the 1950 to 1957 birthrate data. The scatterplot for the 1950 to 1957 data suggests that a linear model for this time interval is appropriate.

extrapolation - bc birthrate data
Figure 13: Extrapolation from BC birth rate data (British Columbia Vital Statistics Agency, 2003)
  • Regression line (1950 to 1957): \widehat{livebirths}=-606.265+0.3232\,year
  • Correlation: r = 0.9601
  • R2 = 0.9217

Using this line to predict the birthrate for 1965 gives -606.265 + 0.3232(1965) ≈ 28.82 live births per 1,000 population. Of course, this extrapolated prediction is way off the actual 1965 birthrate of 18.74 because, as history shows, the baby boom sharply declined in the early 1960s.

Now, to predict BC’s birthrate in year 2010, what model should we use?

There is no right or wrong answer here, but one reasonable approach would be to use the data from year 2000 onward. This gives 199.965 − 0.095(2010) ≈ 9.015 live births per 1,000 population as an extrapolated estimate. Seems plausible, and not far off the actual value of 9.64!

Regression on Summary Data

Here’s a nicely “cooked” example designed to exaggerate the potentially misleading conclusions from a regression on summary data.

“Raw” Data

summary data example - raw data
Figure 14: Raw data for price (dollars) and age of vehicles (1, 2, 3, or 4 years old) 
  • Regression line: \widehat{price}=20943-1857\,age
  • Strength of the linear relationship: r ≈ – 0.73 and R2 ≈ 0.53

“Summary” Data

Now let’s redo this linear regression using the mean price for each of the four ages (instead of the individual car prices).

summary data example - summary data
Figure 15: Summary data using the mean price of vehicles by age (1, 2, 3, or 4 years old)
  • Regression line: \widehat{price}=21000-1900\,age
  • Strength of the linear relationship: r ≈ – 0.98 and R2 ≈ 0.96

The two regression lines are very similar, but there is a considerable difference in the implied strength of the linear relationship (a moderately strong linear relationship, r ≈ – 0.73, for the regression on the individual car prices vs. an almost perfect linear relationship, r ≈ – 0.98, for the regression on the mean car prices).

So, what is the issue?

Be wary of claims about strong linear relationships based on regressions on group means. The reduced variability of group means typically inflates the strength of the linear relationship since it ignores the variability of individual measurements.

Risk of Interpreting Linear Models as Causal

When regression analysis reveals a strong linear relationship between two variables, it is very tempting to interpret the relationship as causal. However tempting, it is wrong to conclude that changes in the explanatory variable (x) are causing changes in the response variable (y) from the regression analysis alone.

Sometimes a causal interpretation is valid and clear from our understanding of the variables and the design of the study. For example:

  • x = number of alcoholic drinks consumed by a person
  • y = blood alcohol level of the person

Clearly, increasing the value of x causes an increase in the value of y.

Sometimes the causal interpretation is obviously wrong because it is clear that x and y are simply responding to another variable, called a lurking variable, that is not explicitly controlled for in the study. Remember this example from Supplementary Notes 6.1?

  • x = summertime daily number of visitors to Vancouver’s beaches
  • y = summertime daily volume of water drawn from Vancouver’s reservoirs

Here, it was clearly absurd to conclude a causal relationship because it was obvious that these two variables are each responding to a lurking variable: summertime daily temperature. Hot days cause higher values of both x and y, and cool days cause lower values of both x and y.

However, in many cases a discovered linear relationship between two variables is simply a starting point that leads to follow-up studies that have been properly designed to reveal whether the relationship is causal. In Lesson 1.2, we explored some of the techniques and issues related to designing experiments that can lead to valid causal conclusions. For example:

  • x = daily dosage level of some herbal extract
  • y = number of colds a person gets in a year

If the study data was collected from people who had decided on their own to try the herbal extract and who set their own dosage level, it would be wrong to conclude a causal relationship. We identified this type of study as an observational study.

However, if the study was designed in such a way that the researcher initially randomly assigned subjects to different dosage levels, a causal conclusion is possible (subject to other experimental design issues being met). We identified this type of study as an experimental study.

Possible Effects of Outliers

In statistical analyses, outlier or unusual points are always a concern because of the possibility that they can unduly distort the analysis. In regression analysis the effect of these points is very sneaky. Sometimes they produce dramatic changes in the proposed linear model; other times they don’t. In fact, in some cases they actually appear to strengthen the model!

The following terminology helps us characterize these unusual points.

  • Outlier point: A data point that stands apart from the rest of the point cloud in the scatterplot.
  • High leverage: A data point has “high leverage” if it has an x-value that is far away from the mean of the x-values.
  • Influential: A data point is “influential” if the slope of the regression line model changes considerably depending on whether the point is included or excluded from the analysis.

The following example provides a concrete feel for how individual points can affect the linear model. As we go through the various cases, confirm the regression and correlation results using jamovi.

Example: Understanding Outliers

For this small dataset of four points, we get the following linear model results:

outlier example - 1
Figure 16: Outlier example with regression line 
  • Download outliers [CSV file] and open the data in jamovi.
  • Go to the Data tab and double-click the column headers to change the variable measure types to Continuous.
  • Select Analyses > Exploration > scatr > Scatterplot, move x to X-axis, move y to Y-axis, and under Regression Line select Linear.
  • Select Analyses > Regression > Linear Regression, move y to Dependent Variable, move x to Covariates, and under Save select Residuals.
  • Regression line: \hat{y}=2.8+0.4x
  • Correlation: r = 0.316
  • Point of averages: (3, 4) since \overline{x}=12/4=3 and \overline{y}=16/4=4

Now let’s explore the effect on the linear model results by adding a fifth point to the dataset. Try to anticipate the effect of each alternative fifth point before actually reading the discussion and going through the calculations.

  1. Add the point (3, 4):
    • What’s special about the point (3, 4)? It is the point of averages for the original scatter.
    • Go to the Data tab and type “3” in the first empty cell in the “x” column and type “4” in the first empty cell in the “y” column.
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=2.8+0.4x
    • New correlation: r = 0.316
    • New point of averages: (3, 4) \overline{x}=15/5=3 and \overline{y}=20/5=4
    outliers example - 2
    Figure 17: The new point (3, 4) (shown as a square) is not an outlier, does not have high leverage, and is not influential; (solid line) regression line. 

    Remember that the original regression line goes through the point of averages (3, 4), and by adding it as a data point, there is no change to the point of averages, the regression line, or the correlation. The size of the residual for this new point is 0 since it is right on the new (and old) regression line.

  2. Add the point (8, 6):
    • What’s special about the point (8, 6)? It’s an unusual point in that the x value of this point is considerably larger than the mean of the x values. However, it is perfectly consistent with the original linear model since the original regression line passes through the point (8, 6).
    • Go to the Data tab and change the fifth data point to “8” for “x” and “6” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=2.8+0.4x
    • New correlation: r = 0.5
    • New point of averages: (4, 4.4) \overline{x}=20/5=4 and \overline{y}=22/5=4.4
    outlier example - 3
    Figure 18: The new point (8, 6) (shown as a square) is an outlier with high leverage but it is not influential: (solid line) regression line, (X) new point of averages.  

    There is no change to the regression line model since the new point (8, 6) and the new point of averages (4, 4.4) fall exactly on the old regression line. However, the correlation coefficient has increased from r = 0.316 to r = 0.5. Based on the increase in r, this one new point seemingly has strengthened the linear relationship considerably. The size of the residual for this new point is 0 since it is right on the new (and old) regression line.

  3. Add the point (10, 7):
    • What’s special about the point (10, 7)? It’s an unusual point in that the x value of this point is considerably larger than the mean of the x values. However, it is consistent with the original linear model since the original regression line passes close to the point (10, 7).
    • Go to the Data tab and change the fifth data point to “10” for “x” and “7” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=2.740+0.423x
    • New correlation: r = 0.616
    • New point of averages: (4.4, 4.6) \overline{x}=22/5=4.4 and \overline{y}=23/5=4.6
    outlier example - 4
    Figure 19: The new point (10, 7) (shown as a square) is an outlier with high leverage but it is not influential: (solid line) new regression line, (dash line) old regression line, (X) new point of averages. 

    There is little change to the regression line model since the new point (10, 7) and the new point of averages (4.4, 4.6) are both nearly on the old regression line. However, the correlation coefficient has increased dramatically from r = 0.316 to r = 0.616. Based on the increase in r, this one new point seemingly has strengthened the linear relationship considerably. The size of the residual for this new point is small (0.033) since it is just above the new regression line.

  4. Add the point (3, 8):
    • What’s special about the point (3, 8)? It’s an unusual point in that the y value of this point is considerably larger than the mean of the y values, although the x value is right on the mean of the x values.
    • Go to the Data tab and change the fifth data point to “3” for “x” and “8” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=3.6+0.4x
    • New correlation: r = 0.236
    • New point of averages: (3, 4.8) \overline{x}=15/5=4 and \overline{y}=24/5=4.8
    outlier example - 5
    Figure 20: The new point (3, 8) (shown as a square) is an outlier, not high leverage and not influential: (solid line) new regression line is above (dash line) old regression line, (X) new point of averages. 

    The point (3, 8) has “pulled” the regression line model upwards. Why? Remember that the regression line always goes through the point of averages, and the new point of averages is (3, 4.8), which is directly above the old point of averages (3, 4). The slope of the line remains unchanged. The correlation coefficient has decreased from r = 0.316 to r = 0.236, indicating a weaker linear relationship. The size of the residual for this new point is large (8 − 4.8 = 3.2), indicating that it is well above the new regression line.

  5. Add the point (1, 7):
    • What’s special about the point (1, 7)? It’s an unusual point in that the x value of this point is somewhat below the mean of the x values, and the y value is well above the mean of the y values.
    • Go to the Data tab and change the fifth data point to “1” for “x” and “7” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=4.7576-0.0606x
    • New correlation: r = –0.0457 (it must be negative because the slope of the line is negative).
    • New point of averages: (2.6, 4.6) \overline{x}=13/5=2.6 and \overline{y}=23/5=4.6
    outlier example - 6
    Figure 21: The new point (1, 7) (shown as square), (solid line) new regression line, (dash line) old regression line, (X) new point of averages. 

    The point (1, 7) has produced a dramatic change in the regression line model, tipping it downwards since the new point of averages (2.6, 4.6) is higher and to the left of the old point of averages (3, 4). Remember that the new regression line is supposed to fit the new scatter as well as it possibly can, plus it must go through the new point of averages. The correlation coefficient has decreased from r = 0.316 to r = –0.0457, indicating a very weak negative linear relationship. The size of the residual for this new point is moderately large (7 − 4.7 = 2.3), indicating that it is above the new regression line.

  6. Add the point (8, 9): 
    • What’s special about the point (8, 9)? It’s an unusual point in that both the x and y values of this point are considerably larger than the respective means of the dataset.
    • Go to the Data tab and change the fifth data point to “8” for “x” and “9” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=1.8+0.8x
    • New correlation: r = 0.730
    • New point of averages: (4, 5) \overline{x}=20/5=4 and \overline{y}=25/5=5
    outlier example - 7
    Figure 22: The new point (8, 9) (shown as square) is an outlier with high leverage and influence: (solid line) new regression line, (dash line) old regression line, (X) new point of averages.

    The point (8, 9) has produced a considerable change in the regression line model, tipping it upwards since it must pass through the new point of averages (4, 5). The correlation coefficient has increased dramatically from r = 0.316 to r = 0.730, indicating a moderately strong linear relationship. The size of the residual for this new point is quite small (9 − 8.2 = 0.8), indicating that it is just above the new regression line.

  7. Add the point (10, 0):
    • What’s special about the point (10, 0)? It’s an unusual point in that the x value is unusually large, and the y value is unusually small relative to the respective means of the dataset.
    • Go to the Data tab and change the fifth data point to “8” for “x” and “6” for “y.”
    • The scatterplot with fitted regression line and the regression model results should automatically update.
    • New regression line: \hat{y}=4.846-0.374x
    • New correlation: r = –0.489
    • New point of averages: (4.4, 3.2) \overline{x}=22/5=4.4 and \overline{y}=16/5=3.2
    outlier example - 8
    Figure 23: The new point (10, 0) (shown as square) is an outlier with high leverage and influence: (solid line) new regression line, (dash line) old regression line, (X) new point of averages.

    The point (10, 0) has produced a very dramatic change in the regression line model, tipping it downwards since it must pass through the new point of averages (4.4, 3.2). The correlation coefficient has changed dramatically from r = 0.316 to r = – 0.489, indicating a moderate negative linear relationship. The size of the residual for this new point is quite small (0 − 1.106 = –1.106), indicating that it is below the new regression line.

This example has given us a feel for how unusual points can affect the regression line and correlation coefficient. Keep in mind, though, that this example was based on a very small dataset. With small datasets, unusual points can have a much more pronounced effect on the model than with large datasets.

The following graphs (Fig. 24) summarize the main ideas and terminology.

 

 

 

different types of outliers
Figure 24: Different types of outliers: (top) outlier has high leverage, not influential and small residual; (centre) outlier has high leverage and influence and large residual; (bottom) outlier has no leverage or influence, large residual. 

Summary

When we detect unusual points in a regression analysis, we should run the analysis both with, and without, these unusual points. The subsequent discussion of the linear model should include comments about these unusual points and how these points affect the model:

  • Did the slope change much?
  • Did the y-intercept change much?
  • Did the correlation coefficient change much?
  • Do these points have large or small residuals?

References

British Columbia Vital Statistics Agency. (2003). Selected vital statistics and health status indicators annual report 2003. Ministry of Health Services, Government of British Columbia. http://www.vs.gov.bc.ca/stats/annual/2003/index.html

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Probability and Statistics Copyright © 2023 by Thompson Rivers University is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book