by Xiangyu Wang
with Greg Page
Leverage in Simple Linear Regression: Overview
In this article, we will discuss high leverage points in simple linear regression (SLR). Simply put, high leverage points in linear regression are those with extremely unusual independent variable values in either direction from the mean (large or small). Such points are noteworthy because they have the potential to exert considerable “pull”, or leverage, on the model’s best-fit line.
The mathematical formula used to calculate the leverage score for any particular input value in an SLR model is shown here:
Further down in this post, we will show a step-by-step breakdown of how this formula can be used to calculate the leverage values for particular observations.
In Multiple Linear Regression (MLR), the general concept of leverage remains the same — high leverage observations are those with extreme input values, relative to the rest of the dataset. In MLR, however, the inputs are based on combinations of multiple variables’ values, so a different formula is used in the calculation.
This dataset used in this article, vehicles.csv, contains information about more than 400,000 used car listings in the United States, scraped from the website craigslist. This dataset is available on Kaggle. Here, we will use it to build an SLR model, with price (measured in dollars) as the outcome, and odometer (measured in miles) as the input variable.
Building the SLR Model: Price as a Function of Odometer Mileage
used_car <- read.csv(‘vehicles.csv’)
Since linear regression modeling relies on complete cases of data, we started out by removing all rows that contained “NA” values for odometer, using the drop_na() function from tidyr:
The step shown above removed more than 55,000 rows, or approximately 12% of the original data.
Next, we filtered the data so that it only contained vehicles whose odometer readings were greater than 0 and less than 250,000 miles, and whose prices were positive values less than USD $40,000. While this cleared away approximately 13.6% of the remaining data, it helped us remove several erroneous values (the original dataset included a car with more than two billion miles, and another with a three billion dollar price tag!)
used_car <- used_car %>% filter(odometer > 0 & odometer < 250000 & price > 0 & price < 40000)
With those data preparation steps now complete, we built the car_model SLR model using the lm() function:
From the summary function shown above, we can see that these variables have a statistically significant relationship. The negative coefficient value suggests a tiny decrease in car resale value as more miles are driven, which fits with our intuitive expectation.
Identifying High Leverage Points using the broom package
The augment() function from the broom package creates a handy dataframe that contains record-by-record diagnostic information about our regression model:
model_data <- augment(car_model)
As you can see, augment() delivers quite a bit of information about each observation as it relates to the model. For this article, we will stay narrowly focused on leverage, which is shown here as .hat.
To see the highest leverage points, we can use the arrange() function, as follows:
model_data %>% arrange(desc(.hat))
To see where these values really come from, let’s revisit the formula that we showed at the beginning of this article:
In the formula, leverage value for some particular observation is represented by hi. In the first term after the equals sign, 1/n, the n represents the number of observations used in the model. In the last term, the (xi — x)2 term in the numerator represents the squared difference between that particular observation and the mean value for the independent variable. Finally, the denominator of that fraction in the last term is the sum of the squared differences from the mean among all the observations in the entire dataset.
n <- nrow(used_car)
xbar <- mean(used_car$odometer)
sumsquares <- sum((used_car$odometer-xbar)²)
Similarly, we could determine the leverage value for any point in the entire dataset, just by replacing its odometer value with the one shown above, for the car whose odometer reading was 249,961. From the formula shown above, we can see that points whose values are further from the mean will have higher leverage scores, and that points closer to the mean will have smaller ones.
For any particular observation, the .hat value must fall between 1/n and 1. In any SLR model, the average leverage score for all the observations is (p+1)/n. Here, that’s 2/348134:
So what should be considered a high leverage point? As with many such questions in statistics, there is no single, definitive answer. However, a common rule is to flag any observation whose leverage score is more than three times greater than the mean leverage value as a high leverage point.
We can filter such data points into a new dataframe with the following command:
high_lev <- filter(model_data,.hat>3*mean(model_data$.hat))
The 5332 observations that belong to high_lev represent just above 1.5% of the total number of records used to build the model. A summary of the records in high_lev shows that it consists of observations with high values for odometer. While high leverage points can theoretically come from observations considerably above or below the mean, the distribution of the data can play a role in determining which types of observations tend to depart considerably from the average. For this dataset, the mean odometer value from the model inputs was 94,658.76. With a lower constraint of 0 and an upper constraint of 250,000, this means that inputs have more room to deviate on the higher side than on the lower side — and the summary of the odometer values in high_lev shows that the source of major deviations here comes from higher odometer numbers.
High Leverage Points: Potentially, but not Necessarily, Influential
To reiterate something written in this article’s first paragraph, high leverage points are noteworthy for their potential to influence the regression model. Some high leverage points’ predicted outcome values fall close to the regression line, and therefore do not exert a major influence on the model.
To illustrate this principle, let’s take a look at two histograms; first, we’ll see a histogram of the original model’s residuals, generated from all 348,134 observations; next, we’ll see a histogram that shows only the residuals generated by the 5332 high leverage observations from the original model.
When we view the residuals from the entire set of records, we observe a mostly symmetric distribution. The histogram depicting the residuals from the high leverage subset is far more skewed, with an average value much further from zero. However, the outsize height of the second histogram’s tallest bar tells us that most of the high leverage points do not exert a tremendous degree of influence upon the model (if this were a much smaller dataset, this conclusion would be harder to make, as we would have to also consider whether a high leverage observation with a low residual was closely fitting the line because it influenced the line so much).
As for measuring the influence of particular points, there are other metrics that serve such a purpose. Among these is Cook’s Distance, a metric based on an observation’s leverage and residual value. Since the scope of this article was to define and illustrate leverage for SLR models, we will not delve further into the influence of specific observations here.
The author is a Master’s student in Applied Business Analytics at Boston University, Metropolitan College. He will graduate in Spring 2021. His co-author is a Senior Lecturer in Applied Business Analytics at Boston University.