Sampling Distribution and Bootstrap

STAT 120

Bastola

Statistical Inference

Statistical inference is the process of drawing conclusions about the entire population based on information in a sample.

Statistical Inference

Motivating Example 1

Regression line of Bood alcohol content (BAC) Vs. number of beers

Can you drink 5 beers and stay under the 0.08 limit?

Motivating Example 2

Striking rates by race

Do the observed differences in strike rates between black and white eligible jurors indicate a potential bias, or are the differences just due to chance?

Statistic and Parameter

A parameter is a number that describes some aspect of a population.
A statistic is a number that is computed from data in a sample.

	Parameter	Statistic
Mean	\(\mu\)	\(\bar{x}\)
Proportion	\(p\)	\(\hat{p}\)
Std. Dev.	\(\sigma\)	\(s\)
Correlation	\(\rho\)	\(r\)
Slope	\(\beta\)	\(b\)

Parameter Vs. Statistic

State whether the quantity described is a parameter or a statistic, and give the correct notation.

Average household income for all houses in the US, using data from the US census

The proportion of all residents in a county who voted in the last presidential election.

The difference in proportion who have ever smoked cigarettes, between a sample of 500 people who are 60 years old and a sample of 200 people who are 25 years old.

Point Estimate (PE)

Point estimate is a single value constructed from the sample data
Sample statistic can serve as a point estimate for an unknown parameter

Sampling Distribution

A sampling distribution is the distribution of sample statistics computed for different samples of the same size from the same population.

Sample statistics varies from sample to sample
Sampling distribution gives us an idea of the variation

Center and Shape

Center: If samples are randomly selected, the sampling distribution will be centered around the population parameter.

Shape: For most of the statistics we consider, if the sample size is large enough the sampling distribution will be symmetric and bell-shaped.

Standard Error

Uncertainty in point estimates measured by the standard error (SE)

The standard error of a statistic is the standard deviation of the sampling distribution
The standard error measures how much the statistic varies from sample to sample

A Short demo on Sampling distribution

05:00

Recall: Gettysburg Address

The standard error for the average word size in a random sample of 10 words is closest to

Sample Size Matters!

As the sample size increases, the variability (SE) of the sample statistics tends to decrease.
Smaller SE means the sample statistics tend to be closer to the true population parameter value!

Sample Size vs. Simulation size

Sample size (n) = how many individuals are in the sample used to compute our stat?

Simulation size (N) = how many random samples did we take from the population to simulate the sampling distribution of our stat?

The SE of your stat gets smaller as \(n\) get bigger.
Once you’ve simulated a couple \(100\) samples, the shape/center/spread of the sampling distribution should remain about the same as you increase the simulation size.

Random Vs. Non-random

Samples of size 5 are taken from a large population with population mean 8, and the sampling distributions for the sample means are shown. Dataset A (top) and Dataset B (bottom) were collected using different sampling methods. Which dataset (A or B) used random sampling?

Random Vs. non-random data distribution

Bootstrap

Bootstrap: Sample with replacement from the original sample, using the same sample size.

Original sample (left) to bootstrap sample (right)

Bootstrap

Original sample (left) to population (right)

Creating a bootstrap sample is the same as using the data simulate a “population” that contains an infinite number of copies of the data.

Bootstrap Sampling in R

resample a set of observations with replacement
same data points can appear multiple times

	Data
Original sample	\(x_1, x_2, ..., x_n\)
Resample	\(x_1^, x_2^, ..., x_n^*\)

# R-code
boot <- sample(x, size, replace = TRUE)

Bootstrap Steps

Generate a bootstrap sample.
Compute the statistic of interest for your bootstrap sample.
Repeat steps (1) – (2) many times. Plot the distribution of all your bootstrap statistics

This is the bootstrap distribution!

Statkey Demo Page for Bootstrap

Please download the Class-Activity-7 template from moodle and go to class helper web page

20:00

Sampling Distribution and Bootstrap

Statistical Inference

Motivating Example 1

Motivating Example 2

Statistic and Parameter

Parameter Vs. Statistic

Point Estimate (PE)

Sampling Distribution

Center and Shape

Standard Error

Group Activity 1

Recall: Gettysburg Address

Sample Size Matters!

Sample Size Matters!

Sample Size vs. Simulation size

Random Vs. Non-random

Bootstrap

Bootstrap

Bootstrap Sampling in R

Bootstrap Steps

Group Activity 2