Practice Problems 25
Problem 1: Cuckoo Eggs
The common cuckoo does not build its own nest: it prefers to lay its eggs in another birds’ nest. It is known, since 1892, that the type of cuckoo bird eggs are different between different locations. In a study from 1940, it was shown that cuckoos return to the same nesting area each year, and that they always pick the same bird species to be a “foster parent” for their eggs. Over the years, this has lead to the development of geographically determined subspecies of cuckoos. These subspecies have evolved in such a way that their eggs look as similar as possible as those of their foster parents.
The cuckoo dataset contains information on 120 Cuckoo eggs, obtained from randomly selected “foster” nests. For these eggs, researchers have measured the length
(in mm) and established the type
(species) of foster parent. The type column is coded as follows:
type=1
: Hedge Sparrowtype=2
: Meadow Pittype=3
: Pied Wagtailtype=4
: European robintype=5
: Tree Pipittype=6
: Eurasian wren
The researchers want to test if the type of foster parent has an effect on the average length of the cuckoo eggs.
(1a) The boxplot of the length of the eggs across all the species is shown below. Based on these boxplots, do the assumptions of normality and similar variability appear to be met?
(1b) Formally verify that the assumptions are valid by using the outputs given.
Click for answer
Answer: Based on the qqplot, the data points in each group are close to the line and there are no major deviations towards the center. So, the normality assumption seems to be satisfied.
%>%
Cuckoo ggplot(aes(sample=length)) + geom_qq() + geom_qq_line() + facet_grid(~species) + theme_bw()
Similarly, based on the statistics below, the ratio of the largest \(s\) to the smallest \(s\) is \(1.57\). So, the equal variance assumption is satisfied.
Caution: If the equal variance assumption or the normality assumption is not met in ANOVA, then the results of the oneway ANOVA may not be reliable. This is especially true if the sample sizes between the groups are unequal and the variances between the groups are also unequal.
1.0722917/0.6821229
[1] 1.571992
library(dplyr)
< Cuckoo %>% group_by(species) %>% summarize(mean(length), sd(length), length(length))
stat < as.data.frame(stat)
stat stat
species mean(length) sd(length) length(length)
1 hedge.sparrow 23.11429 1.0494373 14
2 meadow.pipit 22.29333 0.9195849 45
3 pied.wagtail 22.88667 1.0722917 15
4 robin 22.55625 0.6821229 16
5 tree.pipit 23.08000 0.8800974 15
6 wren 21.12000 0.7542262 15
(1c) Fit an ANOVA model to do a formal hypothesis test. Report the test statistics and conclude your hypothesis test.
< aov(length~species, Cuckoo)
fit_anova summary(fit_anova)
Df Sum Sq Mean Sq F value Pr(>F)
species 5 42.81 8.562 10.45 2.85e08 ***
Residuals 114 93.41 0.819

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Click for answer
Answer: The hypotheses can be stated as:
\[H_0:\mu_1 = \mu_2 = \cdots = \mu_k\] \[H_a: \text{at least one } \mu_i \text{ is different}\]
Let’s assume the conditions for the test are approximately met. To find which of the species differ from the rest, we need to construct confidence intervals for the mean length differences between each pair of species.(1d) First, find a 95% confidence interval for the mean cuckoo egg length in European robin
nests (Type = 4).
Click for answer
Answer:
95 \(\%\) confidence interval is:
< 0.8193847
MSE 4,2] + c(1,1)*(qt(10.05/2, df=113))*sqrt(MSE)/sqrt(stat[4,4]) stat[
[1] 22.10791 23.00459
\[22.556 \pm 1.981*\frac{\sqrt{0.8194}}{\sqrt{16}}\] \[= (22.108, 23.005)\]
(1e) Find a 95% CI for the difference in mean egg length between European robin
(type = 4) and Eurasian wren
(type = 6) nests.
Click for answer
Answer:
4,2]  stat[6,2]) + c(1,1)* (qt(10.05/2, df=113))* sqrt(MSE*(1/stat[4,4] + 1/stat[6,4])) (stat[
[1] 0.79172 2.08078
\[(22.556  21.120) \pm 1.981 \cdot \sqrt{0.8194\left(\frac{1}{16} + \frac{1}{15} \right)}\]
\(=(0.792, 2.081)\)
(1f) Find a 95% CI for the difference in mean egg length between Pied Wagtail
(type = 3) and European robin
(type = 4) nests.
Click for answer
Answer:
3,2]  stat[4,2]) + c(1,1)* (qt(10.05/2, df=113))*sqrt(MSE*(1/stat[3,4] + 1/stat[4,4])) (stat[
[1] 0.3141134 0.9749467
(1g) We can use the R function pairwise.t.test
to analyze which pair of means are significantly different from one another. Using p.adjust.method = "bonferroni"
, we will see the pvalues adjusted for multiple comparison. These adjusted pvalues should still be compared with \(\alpha = 0.05\) to find any significant differences.
Based on the R output, which of the pairs are different?
pairwise.t.test(Cuckoo$length, Cuckoo$species, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: Cuckoo$length and Cuckoo$species
hedge.sparrow meadow.pipit pied.wagtail robin tree.pipit
meadow.pipit 0.05554    
pied.wagtail 1.00000 0.44898   
robin 1.00000 1.00000 1.00000  
tree.pipit 1.00000 0.06426 1.00000 1.00000 
wren 5e07 0.00045 7e06 0.00035 5e07
P value adjustment method: bonferroni
Click for answer
Answer:
Based on the adjusted pvalues we can say the five pairs of species 61
, 62
, 63
, 64
, and 65
are different at the significance level of 5%. Here, each pairwise test is testing:
Problem 2: Metal Contamination
An environmental studies student working on an independent research project was investigating metal contamination in a local river. The metals can accumulate in organisms that live in the river (known as bioaccumulation). He collected samples of Quagga mussels at three sites in the river and measured the concentration of copper (in micrograms per gram, or mcg/g) in the mussels. His data are summarized in the provided table and plot. He wants to know if there are any significant differences in mean copper concentration among the three sites.
Site  Mean (\(\bar{x}\))  SD (\(s\))  \(n\) 

1  21.34  3.092  5 
2  16.60  2.687  4 
3  13.16  4.274  5 
(a) Assumptions
What do we need to assumption about copper concentrations to use oneway ANOVA to compare means at the three sites?
Click for answer
Answer: With such small sample sizes in each group it would be hard to get a good sense of how they are distributed. We will just need to assume that these measurements are approximately normally distributed.
(b) Oneway ANOVA hypotheses
State the hypotheses for this test.
Click for answer
Answer: Let \(\mu_i\) be the true mean copper concentration at location \(i\). Then
\[ H_0: \mu_{1} = \mu_{2} = \mu_{3} \]
vs. \(H_A:\) at least one mean is different.(c) ANOVA table
Fill in the missing values A
 E
from the ANOVA table:
Source  df  SS  MS  F 

Groups  A = 2 
169.05  C = 84.525 
E = 6.99 
Error  11  B = 132.97 
D = 12.088 

Total  13  302.02 
Click for answer
Answer:
A: The group degrees of freedom is always the number of groups minus 1. Here we have 3 groups so \(A = 31=2\).
B: The group and error sum of squares adds up to the total sum of squares. So we have \(B = 302.02  169.05 = 132.97\).
C: Mean square values are always sum of squares divided by degrees of freedom. For groups MS: \(C = 169.05/2 = 84.525\)
D: Mean square values are always sum of squares divided by degrees of freedom. For error MS: \(D = 132.97/11 = 12.088\)
The F test stat is the ratio of the group MS and error MS: \(F = 84.525/12.088 = 6.992\).
302.02  169.05
[1] 132.97
169.05/2
[1] 84.525
132.97/11
[1] 12.08818
84.525/12.088
[1] 6.992472
(d) pvalue
The command pf(x, df1=, df2=)
gives the area under the Fdistribution below the value x
. Use this command to get the pvalue from this oneway ANOVA test. Interpret this value.
Click for answer
Answer: The pvalue is about 1.1%. If the means are the same at the three sites, we would see sample means this different, or even more different, about 1.1% of the time.
1pf(6.992, df1=2, df2=11)
[1] 0.01097789
(e) Conclusion
What is your conclusion for this test?
Click for answer
Answer: We have some evidence that at least one of the true mean copper concentration at the three sites is differenct from the others.
(f) Confidence interval
Compute a 95% confidence interval for the difference in means between site 1 and 3. Interpret this interval.
Click for answer
Answer: Since we don’t have the data, we will have to compute the CI by hand. The degrees of freedom “best guess” (since we aren’t letting R approximate it), is \(11\). The 95% CI for the difference in true means in site 1 and 3 is :
\[ (21.34  13.16) \pm (2.201)) \sqrt{12.088\left(\dfrac{1}{5} + \dfrac{1}{5}\right)} = 3.34, 13.02 \]
21.34  13.16) + c(1,1)* qt(10.05/2, df = 11)*sqrt(12.088*(1/5+1/5)) (
[1] 3.340234 13.019766
We are 95% confident that the true mean copper concentration at site 1 is 3.34 to 13.02 mcg/g higher than the true mean concentration at site 3.