Practice Problems 2

Problem 1

Gettysburg random sample

Let’s take a simple random sample (SRS) of Gettysburg words. The “population” is contained in the spreadsheet GettysburgPopulationCounts.csv. Carefully load this data into R:

pop <- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/GettysbergPopulationCounts.csv")
head(pop)

  position size  word
1        1    4  Four
2        2    5 score
3        3    3   and
4        4    5 seven
5        5    5 years
6        6    3  ago,

The position variable enumerates the list of words in the population (address).

(a). Sample

Run the following command to obtain a SRS of 10 words from the 268 that are in the population:

samp <- sample(1:268, size=10)
samp

 [1] 162 189   9 157 188 120 242 179  50  59

This tells you the position (row number) of your sampled words. What are your sampled positions? Why are your sampled positions different from other folks in class?

(b). Get words and lengths

We will subset the data set pop to obtain only the sampled rows listed in samp. We do this using square bracket notation `dataset[row number, column number/name]. Run the following command to find your sampled words and sizes:

pop[samp,]

    position size    word
162      162    2      is
189      189    6  rather
9          9    7 brought
157      157    4    what
188      188    2      is
120      120    5   brave
242      242    5   under
179      179    6  fought
50        50    2      so
59        59    1       a

Compute your sample mean

The word lengths in part (b) are the data for your sample. You can compute your sample mean using a calculator, or using R. Let’s try R (you will find it faster!). First save the quantitative variable size in a new variable called mysize:

mysize <- pop[samp, "size"]
mysize

 [1] 2 6 7 4 2 5 5 6 2 1

Then find the mean of these values:

mean(mysize)

[1] 4

How does this sample mean (from a truly random sample) compare to your sample mean from the non-random sample?

Click for answer

Answer: The true mean is 4.29. Your two means will likely vary. Since the many non-random samples generally overestimated the population mean length, it is possible (but not guaranteed) that your one non-random sample gave a mean length that is greater than the random sample’s mean length.