<- read.csv("http://people.carleton.edu/~kstclair/data/api.csv") api
Practice Problems 21
Problem 1: API
The Academic Performance Index (API) is computed for all California schools. It is a number, ranging from a low of 200 to a high of 1000, that reflects a school’s performance on a statewide standardized test (http://api.cde.ca.gov). We have a SRS of 200 schools and are interested in how a school’s performance is related to the wealth of its students. The variable growth
measures the growth in API from 1999 to 2000 (API 2000 - API 1999).
(a) Categorizing wealth
Let’s define a school as “low wealth” if over 50% of its students are eligible for subsidized meals
and “high wealth” otherwise. We can use an ifelse
command to create a variable wealth
that measures this:
$wealth <- ifelse(api$meals > 50, "low","high")
apitable(api$wealth)
high low
102 98
library(dplyr)
%>% group_by(wealth) %>% summarize(mean(growth), sd(growth)) api
# A tibble: 2 × 3
wealth `mean(growth)` `sd(growth)`
<chr> <dbl> <dbl>
1 high 25.2 28.8
2 low 38.8 30.0
boxplot(growth ~ wealth, data=api, xlab="API growth (2000 - 1999)" , horizontal=T)
- How many schools are “low” and “high” wealth.
- Are wealth and API growth related?
- What is the observed difference in mean API growth between high and low wealth schools. Use correct notation.
- Can we use t-inference methods to compare mean growths?
Click for answer
Answer: There are \(n_h = 102\) “high” wealth and \(n_l = 98\) “low” wealth schools. The low wealth schools tend to have higher (and more variable) growth than high wealth schools. The difference in observed mean API growth between high and low growth schools is \(\bar{x}_h - \bar{x}_l = 25.24510 - 38.82653 = -13.58\). We can use t-methods since both samples sizes (98 and 102) can be deemed large and there isn’t severe skewness, but there are two extreme outliers that will be addressed below.
(b) SE for the sample mean difference
What is the estimated SE for the sample mean difference?
Click for answer
Answer: The SE for the mean difference is 4.1544:
\[ SD_{\bar{x}_h - \bar{x}_l} = \sqrt{\dfrac{28.75380^2}{102} + \dfrac{29.95048^2}{98}} = 4.1544 \]
sqrt(28.75380^2/102 + 29.95048^2/98)
[1] 4.154404
(c) t-test statistic
Using your SE from (b) to compute the t-test statistic that can be used to determine if mean API growth differs for low and high wealth schools. Write down your hypotheses then show how the t test statistic is calculated. Interpret this value in context.
Click for answer
Answer: The hypotheses are \(H_0: \mu_h - \mu_l = 0\) vs \(H_A: \mu_h - \mu_l \neq 0\). The test stat is
\[t = \dfrac{(25.24510 - 38.82653) - 0}{4.154404} = -3.2692\]
The observed mean difference is 3.3 SEs below the hypothesized mean difference of 0.
25.24510 - 38.82653) - 0)/4.154404 ((
[1] -3.269164
(d) Two-sample t-test
Is there evidence that mean API growth differs for low and high wealth schools? Give the hypotheses for this test, then run the t.test(y ~ x, data=)
command below to conduct a t-test to give a p-value and conclusion.
t.test(growth ~ wealth, data=api)
Welch Two Sample t-test
data: growth by wealth
t = -3.2692, df = 196.71, p-value = 0.001273
alternative hypothesis: true difference in means between group high and group low is not equal to 0
95 percent confidence interval:
-21.774321 -5.388544
sample estimates:
mean in group high mean in group low
25.24510 38.82653
- What is the t test stat given in the output? Verify that it matches your answer to (c), within reasonable rounding error.
Click for answer
Answer: The test stat matches, \(t = -3.2692\).
- What is the p-value for the test? Interpret this value.
Click for answer
- What is your test conclusion?
Click for answer
(e) Consider outliers
The boxplot in (a) shows a number of outliers for the high
wealth group, but two cases in particular were very high. Suppose we omitted these two (most) extreme cases when running the test in (d). Will the p-value for this test be smaller or larger than the p-value computed in part (d)? Explain.
Click for answer
Answer: Removing the two large outliers which will both reduce the mean in the high group and reduce the SD in the high group. Both actions will magnify the difference in mean growth between the high and low groups (increasing the difference and decreasing the SE), so the test stat will increase in magnitude and the p-value will decrease.
(f) Check outlier influence
To omit these cases we have to find their row numbers, then subset
them out of the data:
which(api$growth > 120 )
[1] 74 119
%>% slice(74,119) # another dplyr package command api
cds stype name sname snum
1 5.471911e+13 E Lincoln Element Lincoln Elementary 5873
2 1.975342e+13 E Washington Elem Washington Elementary 2543
dname dnum cname cnum flag pcttest api00 api99 target
1 Exeter Union Elementary 226 Tulare 53 NA 98 693 504 15
2 Redondo Beach Unified 585 Los Angeles 18 NA 100 745 615 9
growth sch.wide comp.imp both awards meals ell yr.rnd mobility acs.k3 acs.46
1 189 Yes Yes Yes Yes 50 18 <NA> 9 18 NA
2 130 Yes Yes Yes Yes 41 20 <NA> 16 19 30
acs.core pct.resp not.hsg hsg some.col col.grad grad.sch avg.ed full emer
1 NA 93 28 23 27 14 8 2.51 91 9
2 NA 81 11 26 32 16 16 2.99 100 3
enroll api.stu pw fpc wealth
1 196 177 30.97 6194 high
2 391 313 30.97 6194 high
t.test(growth ~ wealth, data = api, subset = -c(74,119))
Welch Two Sample t-test
data: growth by wealth
t = -4.395, df = 174.97, p-value = 1.916e-05
alternative hypothesis: true difference in means between group high and group low is not equal to 0
95 percent confidence interval:
-23.571116 -8.961945
sample estimates:
mean in group high mean in group low
22.56000 38.82653
- How does the t-test stat change when omitting these two changes? Why does it change in this direction?
- Check your answer here with your anwer in part (e)!
Click for answer
Answer: Without these outliers, the p-value decreases to 0.00001916 and we have even stronger evidence for a difference in mean API growth. Why does the p-value decrease? Omitting the two outliers will decrease the sample SD for the high group, which in turn will (slightly) decrease the SE for the difference in means. Omitting the two outliers will also decrease the sample mean for the high group (from 25.24510 to 22.56000), which will make the observed difference in means larger in magnitude (from -13.58 to -16.27). The test stat gets even further from 0 (drops from -3.2692 to -4.395), meaning the observed difference with outliers omitted is further away from 0 (in terms of SEs) than it was when all data points were included. This means that the p-value will decrease (from 0.0013 to 0.00002) since the data is deemed more ``extreme” under the null hypothesis.
(g) 95% confidence interval
Compare the two 95% CI given in the output (with and without outliers). Explain how and why the CIs change after omitting these two outliers.
Click for answer
Answer: Without outliers: -23.57 to -8.96 and with outliers: -21.77 to -5.39. An mentioned above, omitting the two points makes the difference in means further away from 0. This shifts the CI further from a difference of 0. Removing the outliers also decrease the SE of our sample difference, so the margin of error for the interval without outliers is, roughly, 7 while the margin of error with outliers is, roughly, 8.
(h) Interpret two-sample CI
Using the results without the two outliers, interpret the 95% CI given in this output. Do not use the word ``difference’’ in your answer.
Click for answer
Answer: We are 95% confident that the mean API growth between 1999 and 2000 for all low wealth schools is anywhere from 8.96 points to 23.57 points higher than the mean API growth for all high wealth schools in California.
Problem 2: Matched Pairs
A study is conducted to determine the effect of a home meter for helping diabetics control their blood glucose levels. Researchers would like to determine if the home meter is effective in helping patients reduce their blood glucose levels. A random sample of 36 diabetics had their blood glucose levels measured before they were taught to use the meter and again after they had utilized the meter for 2 weeks. Researchers observed an average decrease (before - after) of blood glucose level of 2.78 mmol/liter with a standard deviation of 6.05 mmol/liter. Analysis results are shown below:
Sample mean: 2.78 ; sample standard deviation: 6.05 ; sample size: 36
Standard error: 1.0083
95 percent confidence interval for true mean: 1.0763 , Infinity
Hypothesis test H0: mu = 0 Alternative is greater
t statistic = 2.757 ; degrees of freedom = 35 ; p-value= 0.0046
(a) What conditions need to be met by this data to use \(t\) inference procedures?
Click for answer
Answer: There is a moderate sample size of \(n=36\) so we need to assume that the observed differences (before-after) are not strongly skewed and that there are no outliers. If these assumptions are not met, then the t-inference procedures above may not be appropriate.
(b) Define the unknown parameter of interest (be very specific), then state the null and alternative hypotheses for this test. Make sure your hypotheses agree with the output!
Click for answer
Answer: Let the \(\mu\) represent the population mean decrease in glucose levels measured before and after the treatment (before - after). A positive value of \(\mu\) implies that the home meter is effective in reducing blood glucose levels. The alternative hypothesis (the research statement) will be that \(\mu\) is greater than 0 and the null statement will be that \(\mu\) is equal to 0, meaning there is no benefit to using the treatment. \[ H_0: \mu = 0 \textrm{ vs. } H_A: \mu > 0 \]
(c) What is the test statistic value for this test? What does this value indicate?
Click for answer
Answer: The test stat value is 2.757. The mean glucose level decrease in the sample was 2.757 SE’s above the hypothesized mean decrease of 0.
(d) Is there sufficient evidence to claim that the monitor is effective in helping patients reduce their blood glucose levels?
Click for answer
Answer: You reject \(H_0\) when the \(p\)-value is small. Since the P-value of 0.3% is quite small, we can conclude that there is strong evidence that the use of home meters lowers blood glucose levels, on average (\(H_A\)).
(e) What type of error (1 or 2) could you have made in part (d)? If you did make this error, what are its implications for people with diabetes?
Click for answer
Answer: Since we rejected, we may have made a type 1 error of rejecting the null when it is actually true. This means we would have claimed that the home meter was useful in reducing blood glucose levels, on average, when in fact it doesn’t reduce levels. People with diabetes would be encouraged to use these meters (at a cost to themselves or their insurance company) to help control their glucose levels and not see any real benefit.
(f) Compute and interpret a 95% confidence interval for the true average decrease in blood glucose levels. (Note that this CI is not given above, the CI given in the output is a ``one-sided” CI.)
Click for answer
Answer: The 95% CI for the population mean decrease in glucose level is \[
\bar{x} \pm t^*_{n-1} \dfrac{s}{\sqrt{n}} = 2.78\pm 2.042 \dfrac{6.05}{\sqrt{36}} = 2.78 \pm 2.017 = (0.72, 4.84)
\] where \(t^*\) is based on 36-1=35 degrees of freedom. Using the green table, we round df down to 30 so we get \(t^*_{30} = 2.042\). Or using R command qt(.975,df=35)
we get the exact value \(t^*_{35}=2.0301\). We are 95% confident that, after learning to use a home meter, the average decrease in blood glucose in this population is between 0.72 and 4.84 mmol/liter.