Lecture Notes on Linear Regression
Created by Vinay Kanth Rao Kodipelly
Introduction — Real-World Applications
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors) by fitting a linear equation to the observed data. It's widely used for prediction and understanding relationships. Common uses include:
- Housing Prices: Estimate market value based on features like square footage, number of bedrooms, and age.
- Stock Analysis: Model stock returns based on economic indicators or market trends.
- Health Metrics: Empirically investigate relationships, such as the common rule of thumb that maximum heart rate is approximately \(220\) minus age (\(\text{MaxRate} \approx 220 - \text{Age}\)).
- Marketing Spend: Project sales revenue based on advertising expenditure in different channels.
- Educational Research: Predict student performance based on study hours or previous grades.
Interpretation of Intercept and Slope
In the simple linear regression equation:
$$y = b_0 + b_1 x$$
- \(b_0\): The y-intercept — represents the predicted value of the dependent variable \(y\) when the independent variable \(x\) is equal to zero. Geometrically, it's the point where the regression line crosses the y-axis. Caution is needed when interpreting the intercept if \(x=0\) is outside the range of observed data or doesn't have a meaningful interpretation (e.g., height=0).
- \(b_1\): The slope — represents the estimated average change in the dependent variable \(y\) for a one-unit increase in the independent variable \(x\). It indicates the direction (positive or negative) and steepness of the linear relationship (rise over run).
Review — Lines and Linear Equations
Slope-Intercept Form Explained
Any non-vertical straight line can be represented by the equation:
$$y = m x + b$$ where:
m
= slope: measures the steepness of the line (change in y / change in x, or rise/run).b
= y-intercept: the value of y where the line crosses the y-axis (i.e., the value of y when \(x=0\)).
Examples of different lines:
Line Classification Based on Slope
- If the slope \(m > 0\), the line slopes upward from left to right (positive relationship).
- If the slope \(m < 0\), the line slopes downward from left to right (negative relationship).
- If the slope \(m = 0\), the line is horizontal (no linear relationship between x and y).
- A vertical line has an undefined slope and takes the form \(x = c\) (constant).
Comparing Slope vs. Intercept:
- Slope (\(m\) or \(b_1\)): Describes the rate of change. How much does \(y\) change for a one-unit change in \(x\)?
- Intercept (\(b\) or \(b_0\)): Provides a baseline or starting value for \(y\) when \(x=0\).
Key Terms in Regression
Residual: The difference between the observed value (\(y_i\)) and the predicted value (\(\hat y_i\)) for a data point \(i\). Residual = \(y_i - \hat y_i\). Residuals represent the errors of the model's predictions.
Outlier: A data point that lies unusually far from the general trend of the rest of the data, often having a large residual. Outliers can potentially distort the regression line.
Influential Observation: A data point whose removal would cause a significant change in the regression equation (slope and/or intercept). Influential points often have high leverage (extreme x-values) and may or may not be outliers.
Time Series Data: Data points collected sequentially over time (e.g., daily stock prices, monthly sales). Regression can be used to model trends in time series data, although specialized methods often account for temporal dependencies.
Why Finding the "Best Fit" Line Is Hard
When dealing with real-world data (scatter plots), the points rarely fall perfectly on a single straight line. There are infinitely many lines that could be drawn through the data cloud. The challenge is to define and find the line that best represents the overall linear trend in the data according to some objective criterion. This leads to methods like Least Squares.
Least Squares Method — Theory and Visualization
The Ordinary Least Squares (OLS) method is the most common technique for finding the best-fitting line through a set of data points \((x_i, y_i)\). It determines the values of the intercept (\(b_0\)) and slope (\(b_1\)) that minimize the
The objective is to minimize:
$$SSE = \sum_i (y_i - \hat y_i)^2 = \sum_i [y_i - (b_0 + b_1 x_i)]^2$$
This criterion penalizes larger errors more heavily (due to squaring) and ensures that the sum of positive and negative errors doesn't simply cancel out.
Key points about Least Squares:
- It provides a unique, optimal solution for \(b_0\) and \(b_1\) under certain assumptions.
- It forms the basis for estimating trends, making predictions, and understanding relationships.
- Advantages: Widely understood, computationally straightforward, statistically well-grounded (provides unbiased estimates under assumptions), forms the basis for many other statistical techniques.
- Disadvantages: Sensitive to outliers (squaring large errors makes them very influential), assumes a linear relationship, assumes constant variance of errors (homoscedasticity), requires errors to be independent. Simple linear regression only models the relationship between two variables at a time (though multiple regression extends this).
Understanding the Plot:
- Red dotted lines show residuals (errors) between actual and predicted values
- The blue line is the "best fit" line that minimizes the sum of squared errors
- The smaller the sum of squared errors, the better the fit
Mathematical Formulas for Slope and Intercept
The least squares estimates for the slope (\(b_1\)) and intercept (\(b_0\)) are calculated using the following formulas, derived by minimizing the SSE using calculus:
Slope:
$$b_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}$$
Intercept:
$$b_0 = \bar y - b_1 \bar x$$
Where:
- \(n\) is the number of data points.
- \(x_i, y_i\) are the individual data points.
- \(\bar x\) is the mean of the x-values (\(\bar x = \frac{\sum x_i}{n}\)).
- \(\bar y\) is the mean of the y-values (\(\bar y = \frac{\sum y_i}{n}\)).
The Sum of Squared Errors (SSE) for the fitted line is:
$$SSE = \sum_{i=1}^n [y_i - (b_0 + b_1 x_i)]^2$$
Regression via \(S_{xx}\), \(S_{yy}\), & \(S_{xy}\) Notation
It is often convenient to use summary statistics notation:
$$S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2 = \sum x_i^2 - \frac{(\sum x_i)^2}{n}$$
$$S_{yy} = \sum_{i=1}^n (y_i - \bar y)^2 = \sum y_i^2 - \frac{(\sum y_i)^2}{n}$$
$$S_{xy} = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}$$
Using this notation, the formulas become:
Slope: $$b_1 = \frac{S_{xy}}{S_{xx}}$$
Intercept: $$b_0 = \bar y - b_1 \bar x$$
This notation simplifies calculations, especially when done by hand or with a basic calculator.
Hand-Solved Example: Days vs Weight
Let's find the regression line for the following data relating Days (\(x\)) to Weight (\(y\)):
Data:
x (Days): 112, 123, 127, 129, 140, 142, 150 (n=7)
y (Weight): 92, 94, 96, 89, 90, 93, 95
- Calculate Means:
\(\sum x_i = 923 \implies \bar x = 923 / 7 \approx 131.857\)
\(\sum y_i = 649 \implies \bar y = 649 / 7 \approx 92.714\) - Calculate \(S_{xx}\) and \(S_{xy}\) (using computational formulas):
\(\sum x_i^2 = 112^2 + ... + 150^2 = 122917\)
\(\sum y_i^2 = 92^2 + ... + 95^2 = 60135\)
\(\sum x_i y_i = 112*92 + ... + 150*95 = 85713$
\(S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 122917 - \frac{(923)^2}{7} \approx 122917 - 121914.143 = 1002.857\)
\(S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 85713 - \frac{(923)(649)}{7} \approx 85713 - 85591.857 = 121.143\) - Calculate Slope (\(b_1\)):
\(b_1 = \frac{S_{xy}}{S_{xx}} \approx \frac{121.143}{1002.857} \approx 0.1208\) - Calculate Intercept (\(b_0\)):
\(b_0 = \bar y - b_1 \bar x \approx 92.714 - (0.1208 \times 131.857) \approx 92.714 - 15.928 \approx 76.786\) - Regression Equation:
\(\hat y = 76.786 + 0.1208 x\)
R Code for this Example:
# Save the Days vs Weight data
x_days <- c(112, 123, 127, 129, 140, 142, 150)
y_weight <- c(92, 94, 96, 89, 90, 93, 95)
# Fit the linear model
model_days_weight <- lm(y_weight ~ x_days)
# View the summary (includes coefficients, R-squared, etc.)
summary(model_days_weight)
# Coefficients:
# (Intercept) x_days
# 76.7897 0.1208 # Matches hand calculation
# Plot the data and the regression line
plot(x_days, y_weight, main="Days vs Weight", xlab="Days", ylab="Weight", pch=16, col="blue")
abline(model_days_weight, col="red", lwd=2)
grid()
Detailed Mathematically Solved Example
Consider the simple data set: (1, 4), (2, 1), (3, 7). Let's find the regression line \(\hat y = b_0 + b_1 x\).
- Compute Means:
\(n = 3\)
\(\sum x_i = 1 + 2 + 3 = 6 \implies \bar x = 6 / 3 = 2\)
\(\sum y_i = 4 + 1 + 7 = 12 \implies \bar y = 12 / 3 = 4\) - Compute Deviations and Sums for \(S_{xx}\) and \(S_{xy}\):
We need \(\sum (x_i - \bar x)^2\) and \(\sum (x_i - \bar x)(y_i - \bar y)\). Let's use a table:\(x_i\) \(y_i\) \(x_i - \bar x\) \(y_i - \bar y\) \((x_i - \bar x)^2\) \((x_i - \bar x)(y_i - \bar y)\) 1 4 1 - 2 = -1 4 - 4 = 0 (-1)² = 1 (-1)(0) = 0 2 1 2 - 2 = 0 1 - 4 = -3 (0)² = 0 (0)(-3) = 0 3 7 3 - 2 = 1 7 - 4 = 3 (1)² = 1 (1)(3) = 3 Sums: \(S_{xx} = 2\) \(S_{xy} = 3\) - Compute Slope (\(b_1\)):
\(b_1 = \frac{S_{xy}}{S_{xx}} = \frac{3}{2} = 1.5\) - Compute Intercept (\(b_0\)):
\(b_0 = \bar y - b_1 \bar x = 4 - (1.5)(2) = 4 - 3 = 1\) - Regression Equation:
The least squares regression line is \(\hat y = 1 + 1.5x\). - Compute SSE and \(R^2\) (Optional - for model fit):
\(S_{yy} = \sum (y_i - \bar y)^2 = 0^2 + (-3)^2 + 3^2 = 0 + 9 + 9 = 18\).
\(R^2 = \frac{S_{xy}^2}{S_{xx}S_{yy}} = \frac{3^2}{2 \times 18} = \frac{9}{36} = 0.25\).
R Code for this Example:
# Save the simple data
x_simple <- c(1, 2, 3)
y_simple <- c(4, 1, 7)
# Fit the linear model
model_simple <- lm(y_simple ~ x_simple)
# View the summary
summary(model_simple)
# Coefficients:
# (Intercept) x_simple
# 1.0 1.5 # Matches hand calculation
# Plot data and regression line
plot(x_simple, y_simple, main="Simple Example Fit", xlim=c(0,4), ylim=c(0,8), pch=16, col="blue")
abline(model_simple, col="red", lwd=2)
grid()
Correlation Coefficient — Interactive Visualization
The Pearson correlation coefficient (r) measures the strength and direction of the linear association between two continuous variables (\(x\) and \(y\)). It ranges from -1 to +1.
- \(r = +1\): Perfect positive linear relationship.
- \(r = -1\): Perfect negative linear relationship.
- \(r = 0\): No linear relationship (though a non-linear relationship might exist).
- Values closer to +1 or -1 indicate stronger linear relationships.
Formula:
$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum (x_i - \bar x)^2 \sum (y_i - \bar y)^2}}$$
Note that the square of the correlation coefficient, \(r^2\) (or \(R^2\)), represents the proportion of the variance in the dependent variable (\(y\)) that is predictable from the independent variable (\(x\)) using the linear model. This is known as the coefficient of determination.
Interpreting Correlation:
- r = 1: Perfect positive correlation (points form a straight line, increasing)
- r = -1: Perfect negative correlation (points form a straight line, decreasing)
- r = 0: No linear correlation (points show no linear pattern)
- The closer |r| is to 1, the stronger the linear relationship
Correlation: r = [value]
Compute Correlation in R
# Using the simple example data:
x_simple <- c(1, 2, 3)
y_simple <- c(4, 1, 7)
# Calculate Pearson correlation
correlation_simple <- cor(x_simple, y_simple)
print(paste("Correlation (r):", round(correlation_simple, 3)))
# Output: [1] "Correlation (r): 0.5"
# Calculate R-squared (coefficient of determination)
r_squared_simple <- correlation_simple^2
print(paste("Coefficient of Determination (R-squared):", round(r_squared_simple, 3)))
# Output: [1] "Coefficient of Determination (R-squared): 0.25"
# (Matches the value calculated manually in the detailed example)
Additional Examples from Lecture Notes
Regression can be applied to various datasets. Here's an example setup in R for a "Size vs Weight" scenario:
Example: Size vs Weight
# Sample Data
x_size <- c(1, 2, 3, 4, 4, 5, 7, 7)
y_weight2 <- c(4, 5, 7, 7, 10, 12, 10, 12)
# Fit model
model_size_weight <- lm(y_weight2 ~ x_size)
# Get results
print("Size vs Weight Model Summary:")
summary(model_size_weight)
# Plot
plot(x_size, y_weight2, main="Size vs Weight", xlab="Size", ylab="Weight", pch=16, col="darkgreen")
abline(model_size_weight, col="orange", lwd=2)
grid()
# Calculate correlation
correlation_size_weight <- cor(x_size, y_weight2)
print(paste("Size vs Weight Correlation (r):", round(correlation_size_weight, 3)))
# Output: [1] "Size vs Weight Correlation (r): 0.836"
Other common examples include analyzing marketing spend vs sales, temperature vs ice cream sales, study hours vs exam scores, etc.
Try It Yourself — Interactive Practice
Enter Your Own Data Points
Quick Check ✓
What happens to the regression line if we remove an outlier?
Removing an outlier can significantly change the regression line. Click the button below to see this effect:
Why do we square the errors instead of taking absolute values?
We square the errors for several reasons:
- Makes the math easier (differentiable)
- Penalizes large errors more heavily
- Treats positive and negative errors equally