STAT 120
The standard error for \(\bar{x}\) is \[S E_{\bar{x}}=\frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the population SD of your response
The standard error for \(\bar{x}_{1}-\bar{x}_{2}\) is \[S E_{\bar{x}_{1}-\bar{x}_{2}}=\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}\]
But we usually do not know \(\sigma\) !
The sampling distribution for a sample mean, \(\bar{x}\), is approximately \(N\left(\mu, S E_{\bar{x}}\right)\)
When is this approximately “good”?
if \(X \sim N(\mu, \sigma)\) then \(\bar{X} \sim N(\mu, \sigma / \sqrt{n})\)
if \(X \nsim N(\mu, \sigma)\) then \(\bar{X} \sim N(\mu, \sigma / \sqrt{n})\) if \(n \geqslant 30\)
The sampling distribution for a difference of two independent sample means is approximately \(N\left(\mu_{1}-\mu_{2}, S E_{\bar{x}_{1}-\bar{x}_{2}}\right)\)
When is this approximately “good”?
Academic Performance Index (API) is a number reflecting a school’s performance on a statewide standardized test
growth measures the growth in API from 1999 to 2000 (API 2000 - API 1999).Can we use t-inference methods to compare mean growths?
Both samples sizes (98 and 102) can be deemed large
No severe skewness (but two extreme outliers)
Estimated Standard Error : \(SE_{\bar{x}_h - \bar{x}_l} = \sqrt{\dfrac{28.75380^2}{102} + \dfrac{29.95048^2}{98}} = 4.1544\)
Test statistics: \(t = \dfrac{(25.24510 - 38.82653) - 0}{4.154404} = -3.2692\)
The observed mean difference is 3.3 SEs below the hypothesized mean difference of 0
    Welch Two Sample t-test
data:  growth by wealth
t = -3.2692, df = 196.71, p-value = 0.001273
alternative hypothesis: true difference in means between group high and group low is not equal to 0
95 percent confidence interval:
 -21.774321  -5.388544
sample estimates:
mean in group high  mean in group low 
          25.24510           38.82653 The p-value is 0.001273. If there is no difference between mean growth in the two populations, then there is just a 0.13% chance of seeing a sample mean difference that is 3.27 standard errors or more away from 0.
           cds stype            name                 sname snum
1 5.471911e+13     E Lincoln Element    Lincoln Elementary 5873
2 1.975342e+13     E Washington Elem Washington Elementary 2543
                    dname dnum       cname cnum flag pcttest api00 api99 target
1 Exeter Union Elementary  226      Tulare   53   NA      98   693   504     15
2   Redondo Beach Unified  585 Los Angeles   18   NA     100   745   615      9
  growth sch.wide comp.imp both awards meals ell yr.rnd mobility acs.k3 acs.46
1    189      Yes      Yes  Yes    Yes    50  18   <NA>        9     18     NA
2    130      Yes      Yes  Yes    Yes    41  20   <NA>       16     19     30
  acs.core pct.resp not.hsg hsg some.col col.grad grad.sch avg.ed full emer
1       NA       93      28  23       27       14        8   2.51   91    9
2       NA       81      11  26       32       16       16   2.99  100    3
  enroll api.stu    pw  fpc wealth
1    196     177 30.97 6194   high
2    391     313 30.97 6194   hight.test with outliers removed
    Welch Two Sample t-test
data:  growth by wealth
t = -4.395, df = 174.97, p-value = 1.916e-05
alternative hypothesis: true difference in means between group high and group low is not equal to 0
95 percent confidence interval:
 -23.571116  -8.961945
sample estimates:
mean in group high  mean in group low 
          22.56000           38.82653 How does removing outliers influence t-test stat and p-value?
95% Confidence Interval from the output:
Removing Outliers:
Interpretation: We are 95% confident that the mean API growth between 1999 and 2000 for all low wealth schools is anywhere from 8.96 points to 23.57 points higher than the mean API growth for all high wealth schools in California.
Data are paired if the data being compared consists of paired data values. Common paired data examples:
Use paired data to reduce natural variation in the response when comparing the two groups/treatments
Look at the difference between responses for each unit (pair) \[d_{i} = x_{1,i} - x_{2,i}\]
Analyze the mean of these differences rather than the average difference between two groups \[\textrm{sample mean difference: } \bar{d}\] \[\textrm{sample SD of difference: } s_d\] \[\textrm{population mean difference: } \mu_d\]
Use one sample inference methods for these differences
How much higher is non-resident tuition, on average, compared to resident tuition? Use the Tuition2006.csv lab manual data
Diff computes the difference Res - NonRestuition <- read.csv("http://math.carleton.edu/Stats215/RLabManual/Tuition2006.csv")
head(tuition)
##   X        Institution  Res NonRes  Diff
## 1 1 Univ of Akron (OH) 4200   8800 -4600
## 2 2  Athens State (AL) 1900   3600 -1700
## 3 3    Ball State (IN) 3400   8600 -5200
## 4 4  Bloomsburg U (PA) 3200   7000 -3800
## 5 5     UC Irvine (CA) 3400  12700 -9300
## 6 6 Central State (OH) 2600   5700 -3100t-test# alternate method
t.test(tuition$Res, tuition$NonRes, paired = TRUE)
## 
##  Paired t-test
## 
## data:  tuition$Res and tuition$NonRes
## t = -7.5349, df = 18, p-value = 5.69e-07
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -4583.580 -2584.841
## sample estimates:
## mean difference 
##       -3584.211We are 95% confident that the mean tuition for non-residents is $2585 to $4584 higher than mean tuition for residents

30:00