Regression

STAT 120

Bastola

Conducting a Linear Regression Test

Step 1: Stating the Hypotheses

  • Null Hypothesis \((H_0)\): There is no linear relationship between the predictor and response variables. Mathematically, \(H_0: \beta_1 = 0\).
  • Alternative Hypothesis \((H_1)\): There is a linear relationship between the predictor and response variables. Mathematically, \(H_1: \beta_1 \neq 0\).

Conducting a Linear Regression Test

Step 2: Checking Conditions

  1. Linearity: The relationship between the predictor and response variable should be linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: Constant variance of error terms.
  4. Normality: The residuals of the model should be approximately normally distributed.

Conducting a Linear Regression Test

Step 3: Calculating the t-Statistic

The t-statistic is calculated using the formula:

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \]

  • \(\hat{\beta}_1\) is the estimated slope coefficient.
  • \(SE(\hat{\beta}_1)\) is the standard error of the estimated slope.

This statistic helps us determine if the observed relationship between the variables is statistically significant.

Conducting a Linear Regression Test

Step 4: Calculating the p-Value

  • The p-value is obtained from the t-distribution with \(n - 2\) degrees of freedom, where \(n\) is the number of observations.
  • It indicates the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from the sample data, under the assumption that the null hypothesis is true.

Conducting a Linear Regression Test

Step 5: Drawing Conclusions

  • If p-value < \(\alpha\): Reject \(H_0\). There is sufficient evidence to conclude there is a significant linear relationship between the predictor and response variable.
  • If p-value > \(\alpha\): Fail to reject \(H_0\). There is not enough evidence to conclude there is a significant linear relationship.

Confidence Interval for the Slope

  • A 95% confidence interval for the slope \((\beta_1)\) can be calculated as:

\[ CI: \hat{\beta}_1 \pm t^* \cdot SE(\hat{\beta}_1) \]

  • \(t^*\) is the threshold from the t-distribution for a given confidence level.

Interpreting the Confidence Interval

  • If the confidence interval for \(\beta_1\) does not include 0, it supports the conclusion that there is a significant linear relationship.
  • The confidence interval provides a range of plausible values for the slope, giving insight into the strength and direction of the relationship.

Case Study

Objective: Learn to conduct linear regression analysis, focusing on GPA as a response to studying mathematics.

Dataset: School score data comprising GPA and average years spent studying mathematics.

Tools:

  • R for data analysis
  • ggplot2 for data visualization
  • Base R plots for diagnostic checks

 Group Activity 1


  • Please download the Class-Activity-27 template from moodle and go to class helper web page
  • Let’s do a case study together as a class

30:00

Step 1: Load and Inspect Data

# load school_scores.csv from your data directory
school_scores <- readr::read_csv("data/school_scores.csv")
# View the structure of the dataset
dplyr::glimpse(school_scores)
Rows: 577
Columns: 99
$ Year                                                      <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005…
$ State.Code                                                <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC"…
$ State.Name                                                <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
$ Total.Math                                                <dbl> 559, 519, 530, 552, 522, 560, 517, 502, 478, 498, 49…
$ `Total.Test-takers`                                       <dbl> 3985, 3996, 18184, 1600, 186552, 11990, 34313, 6257,…
$ Total.Verbal                                              <dbl> 567, 523, 526, 563, 504, 560, 517, 503, 490, 498, 49…
$ `Academic Subjects.Arts/Music.Average GPA`                <dbl> 3.92, 3.76, 3.85, 3.90, 3.76, 3.88, 3.66, 3.71, 3.54…
$ `Academic Subjects.Arts/Music.Average Years`              <dbl> 2.2, 1.9, 2.1, 2.2, 1.8, 2.2, 2.1, 1.8, 1.8, 1.8, 1.…
$ `Academic Subjects.English.Average GPA`                   <dbl> 3.53, 3.35, 3.45, 3.61, 3.32, 3.49, 3.13, 3.21, 3.03…
$ `Academic Subjects.English.Average Years`                 <dbl> 3.9, 3.9, 3.9, 4.0, 3.8, 4.0, 3.9, 3.9, 3.8, 3.8, 3.…
$ `Academic Subjects.Foreign Languages.Average GPA`         <dbl> 3.54, 3.34, 3.41, 3.64, 3.29, 3.41, 3.03, 3.18, 3.04…
$ `Academic Subjects.Foreign Languages.Average Years`       <dbl> 2.6, 2.1, 2.6, 2.6, 2.8, 3.1, 3.1, 2.7, 2.7, 2.4, 2.…
$ `Academic Subjects.Mathematics.Average GPA`               <dbl> 3.41, 3.06, 3.25, 3.46, 3.05, 3.33, 3.00, 3.07, 2.91…
$ `Academic Subjects.Mathematics.Average Years`             <dbl> 4.0, 3.5, 3.9, 4.1, 3.7, 3.9, 3.8, 3.8, 3.7, 3.8, 3.…
$ `Academic Subjects.Natural Sciences.Average GPA`          <dbl> 3.52, 3.25, 3.43, 3.55, 3.20, 3.43, 3.07, 3.19, 2.99…
$ `Academic Subjects.Natural Sciences.Average Years`        <dbl> 3.9, 3.2, 3.4, 3.7, 3.2, 3.7, 3.5, 3.6, 3.3, 3.5, 3.…
$ `Academic Subjects.Social Sciences/History.Average GPA`   <dbl> 3.59, 3.39, 3.55, 3.67, 3.38, 3.56, 3.18, 3.30, 3.11…
$ `Academic Subjects.Social Sciences/History.Average Years` <dbl> 3.9, 3.4, 3.3, 3.6, 3.3, 3.7, 3.6, 3.6, 3.4, 3.5, 3.…
$ `Family Income.Between 20-40k.Math`                       <dbl> 513, 492, 498, 513, 477, 533, 463, 449, 391, 471, 45…
$ `Family Income.Between 20-40k.Test-takers`                <dbl> 324, 401, 2121, 180, 26161, 948, 2958, 762, 487, 147…
$ `Family Income.Between 20-40k.Verbal`                     <dbl> 527, 500, 495, 526, 458, 535, 467, 454, 404, 473, 46…
$ `Family Income.Between 40-60k.Math`                       <dbl> 539, 517, 520, 543, 506, 543, 493, 481, 433, 492, 48…
$ `Family Income.Between 40-60k.Test-takers`                <dbl> 442, 539, 2270, 245, 18347, 1287, 3186, 802, 246, 12…
$ `Family Income.Between 40-60k.Verbal`                     <dbl> 551, 522, 518, 555, 494, 548, 499, 487, 454, 496, 48…
$ `Family Income.Between 60-80k.Math`                       <dbl> 550, 513, 524, 553, 521, 553, 507, 497, 470, 504, 49…
$ `Family Income.Between 60-80k.Test-takers`                <dbl> 473, 603, 2372, 227, 17937, 1550, 3772, 833, 199, 10…
$ `Family Income.Between 60-80k.Verbal`                     <dbl> 564, 519, 523, 570, 511, 552, 511, 501, 482, 505, 50…
$ `Family Income.Between 80-100k.Math`                      <dbl> 566, 528, 534, 570, 535, 562, 523, 512, 539, 516, 51…
$ `Family Income.Between 80-100k.Test-takers`               <dbl> 475, 444, 1866, 147, 14120, 1427, 3018, 592, 161, 71…
$ `Family Income.Between 80-100k.Verbal`                    <dbl> 577, 534, 533, 580, 525, 560, 523, 510, 549, 517, 51…
$ `Family Income.Less than 20k.Math`                        <dbl> 462, 464, 485, 489, 451, 514, 434, 411, 374, 433, 42…
$ `Family Income.Less than 20k.Test-takers`                 <dbl> 175, 191, 891, 107, 19323, 324, 1612, 373, 535, 8728…
$ `Family Income.Less than 20k.Verbal`                      <dbl> 474, 467, 474, 486, 421, 505, 426, 410, 377, 431, 42…
$ `Family Income.More than 100k.Math`                       <dbl> 588, 541, 554, 572, 566, 574, 565, 554, 608, 544, 54…
$ `Family Income.More than 100k.Test-takers`                <dbl> 980, 540, 3083, 314, 27984, 2662, 5952, 939, 546, 10…
$ `Family Income.More than 100k.Verbal`                     <dbl> 590, 544, 546, 589, 551, 568, 559, 550, 622, 538, 54…
$ `GPA.A minus.Math`                                        <dbl> 569, 544, 541, 559, 562, 573, 585, 534, 566, 530, 53…
$ `GPA.A minus.Test-takers`                                 <dbl> 724, 673, 3334, 298, 30545, 2323, 4742, 1000, 437, 1…
$ `GPA.A minus.Verbal`                                      <dbl> 575, 546, 535, 572, 538, 570, 577, 532, 574, 526, 53…
$ `GPA.A plus.Math`                                         <dbl> 622, 600, 605, 629, 625, 627, 652, 593, 584, 597, 61…
$ `GPA.A plus.Test-takers`                                  <dbl> 563, 173, 1684, 273, 7502, 1098, 497, 311, 60, 5958,…
$ `GPA.A plus.Verbal`                                       <dbl> 623, 604, 593, 639, 603, 614, 643, 585, 578, 589, 60…
$ GPA.A.Math                                                <dbl> 600, 580, 571, 579, 592, 602, 616, 558, 559, 554, 56…
$ `GPA.A.Test-takers`                                       <dbl> 1032, 671, 3854, 457, 25546, 2736, 2646, 956, 316, 1…
$ GPA.A.Verbal                                              <dbl> 608, 578, 563, 583, 565, 598, 606, 556, 570, 550, 55…
$ GPA.B.Math                                                <dbl> 514, 492, 498, 492, 494, 526, 506, 481, 466, 474, 47…
$ `GPA.B.Test-takers`                                       <dbl> 1253, 1622, 7193, 437, 84659, 4312, 17108, 2718, 149…
$ GPA.B.Verbal                                              <dbl> 525, 499, 499, 511, 480, 529, 506, 482, 479, 476, 47…
$ GPA.C.Math                                                <dbl> 436, 466, 458, 419, 434, 484, 431, 422, 370, 420, 41…
$ `GPA.C.Test-takers`                                       <dbl> 188, 418, 1184, 57, 18839, 732, 4338, 731, 643, 9420…
$ GPA.C.Verbal                                              <dbl> 451, 472, 464, 436, 427, 489, 442, 430, 386, 426, 42…
$ `GPA.D or lower.Math`                                     <dbl> 0, 424, 439, 0, 419, 457, 395, 392, 323, 395, 376, 3…
$ `GPA.D or lower.Test-takers`                              <dbl> 0, 12, 16, 0, 240, 12, 105, 19, 13, 111, 74, 25, 2, …
$ `GPA.D or lower.Verbal`                                   <dbl> 0, 466, 435, 0, 408, 462, 407, 399, 377, 408, 403, 3…
$ `GPA.No response.Math`                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `GPA.No response.Test-takers`                             <dbl> 225, 427, 919, 78, 19221, 777, 4877, 522, 660, 7033,…
$ `GPA.No response.Verbal`                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Gender.Female.Math                                        <dbl> 538, 505, 513, 536, 504, 546, 502, 486, 451, 484, 48…
$ `Gender.Female.Test-takers`                               <dbl> 2072, 2161, 9806, 859, 102944, 6407, 17857, 3428, 19…
$ Gender.Female.Verbal                                      <dbl> 561, 521, 522, 558, 499, 558, 513, 498, 475, 496, 49…
$ Gender.Male.Math                                          <dbl> 582, 535, 549, 570, 543, 577, 534, 521, 509, 516, 51…
$ `Gender.Male.Test-takers`                                 <dbl> 1913, 1835, 8378, 741, 83608, 5583, 16456, 2829, 167…
$ Gender.Male.Verbal                                        <dbl> 574, 526, 531, 570, 510, 561, 520, 508, 508, 502, 50…
$ `Score Ranges.Between 200 to 300.Math.Females`            <dbl> 22, 30, 119, 12, 2978, 40, 669, 104, 257, 1436, 1065…
$ `Score Ranges.Between 200 to 300.Math.Males`              <dbl> 10, 20, 72, 7, 1453, 24, 368, 61, 141, 866, 665, 72,…
$ `Score Ranges.Between 200 to 300.Math.Total`              <dbl> 32, 50, 191, 19, 4431, 64, 1037, 165, 398, 2302, 173…
$ `Score Ranges.Between 200 to 300.Verbal.Females`          <dbl> 14, 26, 115, 9, 3382, 39, 460, 110, 199, 1256, 828, …
$ `Score Ranges.Between 200 to 300.Verbal.Males`            <dbl> 17, 26, 86, 3, 2433, 22, 435, 83, 133, 1173, 760, 11…
$ `Score Ranges.Between 200 to 300.Verbal.Total`            <dbl> 31, 52, 201, 12, 5815, 61, 895, 193, 332, 2429, 1588…
$ `Score Ranges.Between 300 to 400.Math.Females`            <dbl> 173, 233, 881, 68, 14595, 313, 2540, 593, 571, 8463,…
$ `Score Ranges.Between 300 to 400.Math.Males`              <dbl> 93, 153, 450, 31, 7159, 202, 1583, 329, 347, 4955, 3…
$ `Score Ranges.Between 300 to 400.Math.Total`              <dbl> 266, 386, 1331, 99, 21754, 515, 4123, 922, 918, 1341…
$ `Score Ranges.Between 300 to 400.Verbal.Females`          <dbl> 123, 218, 739, 46, 15386, 257, 2128, 466, 530, 6947,…
$ `Score Ranges.Between 300 to 400.Verbal.Males`            <dbl> 84, 171, 613, 42, 10784, 212, 1698, 376, 320, 5555, …
$ `Score Ranges.Between 300 to 400.Verbal.Total`            <dbl> 207, 389, 1352, 88, 26170, 469, 3826, 842, 850, 1250…
$ `Score Ranges.Between 400 to 500.Math.Females`            <dbl> 514, 696, 3215, 210, 31530, 1529, 5181, 1157, 423, 1…
$ `Score Ranges.Between 400 to 500.Math.Males`              <dbl> 293, 485, 1948, 137, 20172, 927, 4108, 786, 308, 121…
$ `Score Ranges.Between 400 to 500.Math.Total`              <dbl> 807, 1181, 5163, 347, 51702, 2456, 9289, 1943, 731, …
$ `Score Ranges.Between 400 to 500.Verbal.Females`          <dbl> 430, 656, 3048, 183, 32897, 1343, 5288, 1163, 439, 1…
$ `Score Ranges.Between 400 to 500.Verbal.Males`            <dbl> 332, 552, 2398, 141, 25260, 1140, 4769, 870, 376, 13…
$ `Score Ranges.Between 400 to 500.Verbal.Total`            <dbl> 762, 1208, 5446, 324, 58157, 2483, 10057, 2033, 815,…
$ `Score Ranges.Between 500 to 600.Math.Females`            <dbl> 722, 813, 3576, 316, 30765, 2524, 5533, 1005, 284, 1…
$ `Score Ranges.Between 500 to 600.Math.Males`              <dbl> 614, 616, 3152, 244, 26052, 1889, 5208, 909, 303, 14…
$ `Score Ranges.Between 500 to 600.Math.Total`              <dbl> 1336, 1429, 6728, 560, 56817, 4413, 10741, 1914, 587…
$ `Score Ranges.Between 500 to 600.Verbal.Females`          <dbl> 690, 729, 3661, 302, 30190, 2529, 5729, 1020, 283, 1…
$ `Score Ranges.Between 500 to 600.Verbal.Males`            <dbl> 617, 596, 3101, 236, 25399, 2125, 5276, 866, 279, 13…
$ `Score Ranges.Between 500 to 600.Verbal.Total`            <dbl> 1307, 1325, 6762, 538, 55589, 4654, 11005, 1886, 562…
$ `Score Ranges.Between 600 to 700.Math.Females`            <dbl> 485, 342, 1688, 204, 17625, 1619, 3108, 460, 277, 62…
$ `Score Ranges.Between 600 to 700.Math.Males`              <dbl> 611, 445, 2126, 239, 19980, 1864, 3714, 550, 332, 79…
$ `Score Ranges.Between 600 to 700.Math.Total`              <dbl> 1096, 787, 3814, 443, 37605, 3483, 6822, 1010, 609, …
$ `Score Ranges.Between 600 to 700.Verbal.Females`          <dbl> 596, 423, 1831, 242, 16078, 1708, 3306, 523, 280, 71…
$ `Score Ranges.Between 600 to 700.Verbal.Males`            <dbl> 613, 375, 1679, 226, 14966, 1610, 3215, 473, 349, 66…
$ `Score Ranges.Between 600 to 700.Verbal.Total`            <dbl> 1209, 798, 3510, 468, 31044, 3318, 6521, 996, 629, 1…
$ `Score Ranges.Between 700 to 800.Math.Females`            <dbl> 156, 47, 327, 49, 5451, 382, 826, 109, 137, 1147, 79…
$ `Score Ranges.Between 700 to 800.Math.Males`              <dbl> 292, 116, 630, 83, 8792, 677, 1475, 194, 242, 2328, …
$ `Score Ranges.Between 700 to 800.Math.Total`              <dbl> 448, 163, 957, 132, 14243, 1059, 2301, 303, 379, 347…
$ `Score Ranges.Between 700 to 800.Verbal.Females`          <dbl> 219, 109, 412, 77, 5011, 531, 946, 146, 218, 1571, 1…
$ `Score Ranges.Between 700 to 800.Verbal.Males`            <dbl> 250, 115, 501, 93, 4766, 474, 1063, 161, 216, 1694, …
$ `Score Ranges.Between 700 to 800.Verbal.Total`            <dbl> 469, 224, 913, 170, 9777, 1005, 2009, 307, 434, 3265…

Step 2: Data Cleaning

Cleaning Tasks:

  • Standardize column names
  • Remove rows with missing values
school_scores_clean <- school_scores %>%
  janitor::clean_names() %>% # standardize the column names
  tidyr::drop_na() # drop all the missing values containing rows

Step 3: Data Manipulation

school_final <- school_scores_clean %>%
  dplyr::select(academic_subjects_mathematics_average_gpa, state_name,
                academic_subjects_mathematics_average_years, total_math
                ) %>%
mutate(GPA_category = cut(academic_subjects_mathematics_average_gpa,
                            breaks = c(-Inf, 3.25, 3.5, Inf),
                            labels = c("Low", "Good", "Excellent")))
# alternate
school_final <- school_scores_clean %>%
  dplyr::select(academic_subjects_mathematics_average_gpa, state_name,
                academic_subjects_mathematics_average_years, total_math
                ) %>%
mutate(GPA_category = case_when(
    academic_subjects_mathematics_average_gpa <= 3.25 ~ "Low",
    academic_subjects_mathematics_average_gpa > 3.25 & 
    academic_subjects_mathematics_average_gpa <= 3.5 ~ "Good",
    academic_subjects_mathematics_average_gpa > 3.5 ~ "Excellent"))

ggplot(data = school_final, aes(x = academic_subjects_mathematics_average_years, 
                                y = academic_subjects_mathematics_average_gpa)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Study Years", y = "GPA")

Step 5: Linear Regression Model

GPA.lm <- lm(academic_subjects_mathematics_average_gpa ~ 
               academic_subjects_mathematics_average_years, 
             data = school_final)
summary(GPA.lm)

Call:
lm(formula = academic_subjects_mathematics_average_gpa ~ academic_subjects_mathematics_average_years, 
    data = school_final)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33812 -0.10344  0.00478  0.11188  0.38478 

Coefficients:
                                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                  -0.5592     0.1348  -4.147 3.87e-05 ***
academic_subjects_mathematics_average_years   0.9823     0.0342  28.725  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.138 on 575 degrees of freedom
Multiple R-squared:  0.5893,    Adjusted R-squared:  0.5886 
F-statistic: 825.1 on 1 and 575 DF,  p-value: < 2.2e-16

Step 6: Diagnostic Plots: Residual Plot

plot(GPA.lm, which = 1, cex.main=0.8, cex.lab=0.8) # Residual Plot

Step 6: Diagnostic Plots: QQ Plot

plot(GPA.lm, which = 2, cex.main=0.8, cex.lab=0.8) # QQ Plot

Step 7: Hypothesis Testing for Slope

\(H_0: \beta_1 = 0\) (No relationship) \(H_1: \beta_1 \neq 0\) (A relationship exists)

Hypothesis tests for the slope follow the formula:

\[t=\frac{b_1-\text { null slope }}{S E}\]

library(broom)
tidy(summary(GPA.lm))
# A tibble: 2 × 5
  term                                        estimate std.error statistic   p.value
  <chr>                                          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                                   -0.559    0.135      -4.15 3.87e-  5
2 academic_subjects_mathematics_average_years    0.982    0.0342     28.7  3.29e-113

\[ \widehat{\text{GPA}} = -0.559 + 0.982 \times \text{academic_subjects_mathematics_average_years} \]

Step 8: CI for Slope

\(H_0: \beta_1 = 0\) (No relationship)

\(H_1: \beta_1 \neq 0\) (A relationship exists)

Confidence intervals for the slope follow the formula:

\[b_1 \pm t^* \cdot S E\] where \(b_1\) is the slope estimate, \(SE(b_1)\) is the standard error of the slope, and \(t^*\) is the threshold from the t-distribution.

conf_interval <- confint(GPA.lm, "academic_subjects_mathematics_average_years", level = 0.95)
conf_interval
                                               2.5 %   97.5 %
academic_subjects_mathematics_average_years 0.915102 1.049427

Step 9: Handling Outliers

Visual inspection using boxplots helps in spotting outliers within these categories.

Step 10: Removing Outliers

library(dplyr)
school_selected_no_outlier <- school_selected %>%
  filter(GPA_category == "Excellent" & total_math >= 575 & total_math <= 800 |
         GPA_category == "Good" & total_math >= 525 & total_math <= 575 |
         GPA_category == "Low" & total_math >= 425 & total_math <= 525)

school_selected_no_outlier %>%
  ggplot(aes(x=GPA_category, y=total_math, fill=GPA_category)) +
  theme_bw() +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  labs(title="Boxplot of Math SAT Across GPA Categories",
       y="Average Math Score", x="GPA") +
  stat_summary(fun=mean, geom="point", shape=10, size=2, color="red", fill="black")