STAT 120
Data: each case \(i\) has two measurements
A scatterplot is the plot of \((x_i, y_i)\)
positive association: as \(x\) increases, \(y\) increases
negative association: as \(x\) increases, \(y\) decreases
Correlation coefficient: denoted \(r\) (sample) or \(\rho\) (population)
Correlation can be heavily affected by outliers. Plot your data!
Goal: To find a straight line that best fits the data in a scatterplot
The estimated regression line is \(\hat{y} = a + bx\)
Slope: increase in predicted \(y\) for every unit increase in \(x\) \[ b = \frac{\text{change }\hat{y}}{\text{change } x} \]
Intercept: predicted \(y\) value when \(x = 0\) \[ \hat{y} = a + b(0) = a \]
Geometrically, residual is the vertical distance from each point to the line
Outliers can be very influential on the regression line. Remove the points and see if the regression line changes significantly
Regression of BAC on number of beers
Call:
lm(formula = BAC ~ Beers, data = bac)
Residuals:
Min 1Q Median 3Q Max
-0.027118 -0.017350 0.001773 0.008623 0.041027
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.012701 0.012638 -1.005 0.332
Beers 0.017964 0.002402 7.480 2.97e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02044 on 14 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7855
F-statistic: 55.94 on 1 and 14 DF, p-value: 2.969e-06
Slope, \(b= 0.0180\):
Estimate
column and Beers
rowIntercept, \(a = -0.0127\):
Estimate
column and Intercept
row\[ \widehat{BAC} = -0.0127 + 0.0180(Beers) \]
Slope Interpretation?
y-intercept Interpretation?
If your friend drank 2 beers, what is your best guess at their BAC after 30 minutes? \[\widehat{BAC} = -0.0127 + 0.0180(2) = 0.023\]
Find the residual for the student in the dataset who drank 2 beers and had a BAC of 0.03. The residual is about \(y - \hat{y} = 0.03 - 0.023 = 0.007\)
R-squared is proportion (or percentage) of variability observed in the response y which can be explained by the explanatory variable x.
\(R^2 = r^2\) in simple linear regression model (One explanatory variable)]
BAC : \(R^2 = 0.7998\)
Called Multiple R-squared in the summary output
Call:
lm(formula = BAC ~ Beers, data = bac)
Residuals:
Min 1Q Median 3Q Max
-0.027118 -0.017350 0.001773 0.008623 0.041027
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.012701 0.012638 -1.005 0.332
Beers 0.017964 0.002402 7.480 2.97e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02044 on 14 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7855
F-statistic: 55.94 on 1 and 14 DF, p-value: 2.969e-06
Visually split the data by Sex. Potentially find different trends.
Visually infer difference in Sex in terms of correlation or intercepts
Can also use filter
function available under dplyr
package to divide responses into the groups of interest
library(dplyr)
sat <- read.csv("https://math.carleton.edu/Stats215/RLabManual/sat.csv")
sat.MW <- filter(sat, region == "Midwest") # just MW states
cor(sat.MW$math, sat.MW$verbal)
[1] 0.9731605
Call:
lm(formula = math ~ verbal, data = sat.MW)
Coefficients:
(Intercept) verbal
-23.584 1.047
Correlation = 0.9732, Regression Slope = 1.0469, R-squared = 94.7%
library(dplyr)
# Find the rows where 'verbal' is less than 550 using filter
filtered_data <- filter(sat.MW, verbal < 550)
# Exclude data corresponding to specific indices and calculate correlation
filtered_sat_MW <- sat.MW %>% slice(-c(2, 10))
cor_result <- cor(filtered_sat_MW$math, filtered_sat_MW$verbal)
sat.lm.noIO <- lm( math ~ verbal, data=sat.MW, subset = -c(2,10))
sat.lm.noIO
Call:
lm(formula = math ~ verbal, data = sat.MW, subset = -c(2, 10))
Coefficients:
(Intercept) verbal
6.1453 0.9956
20:00