Let’s take a simple random sample (SRS) of Gettysburg words. The “population” is contained in the spreadsheet GettysburgPopulationCounts.csv. Carefully load this data into R:
pop <-read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/GettysbergPopulationCounts.csv")head(pop)
position size word
1 1 4 Four
2 2 5 score
3 3 3 and
4 4 5 seven
5 5 5 years
6 6 3 ago,
The position variable enumerates the list of words in the population (address).
(a). Sample
Run the following command to obtain a SRS of 10 words from the 268 that are in the population:
samp <-sample(1:268, size=10)samp
[1] 162 189 9 157 188 120 242 179 50 59
This tells you the position (row number) of your sampled words. What are your sampled positions? Why are your sampled positions different from other folks in class?
(b). Get words and lengths
We will subset the data set pop to obtain only the sampled rows listed in samp. We do this using square bracket notation `dataset[row number, column number/name]. Run the following command to find your sampled words and sizes:
pop[samp,]
position size word
162 162 2 is
189 189 6 rather
9 9 7 brought
157 157 4 what
188 188 2 is
120 120 5 brave
242 242 5 under
179 179 6 fought
50 50 2 so
59 59 1 a
Compute your sample mean
The word lengths in part (b) are the data for your sample. You can compute your sample mean using a calculator, or using R. Let’s try R (you will find it faster!). First save the quantitative variable size in a new variable called mysize:
mysize <- pop[samp, "size"]mysize
[1] 2 6 7 4 2 5 5 6 2 1
Then find the mean of these values:
mean(mysize)
[1] 4
How does this sample mean (from a truly random sample) compare to your sample mean from the non-random sample?
Click for answer
Answer: The true mean is 4.29. Your two means will likely vary. Since the many non-random samples generally overestimated the population mean length, it is possible (but not guaranteed) that your one non-random sample gave a mean length that is greater than the random sample’s mean length.