<- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/APM_DougEvansCases.csv") jurors
Practice Problems 4
Problem 1: Flowers v. Mississippi
The data set APM_DougEvensCases.csv
contains data from 1517 potential black and white jurors for 66 cases that Doug Evans was primary prosecutor for between 1992 and 2017. These jurors were available for Doug Evans to strike using his “peremptory strikes” during the jury selection phase.
(a). Inspect data
Read in the data
# dimension of dataset
dim(jurors)
[1] 1517 6
Look at the first three rows of the data set
c(1,2,3), ] jurors[
trial__id race struck_state defendant_race same_race
1 4 Black Not struck by State White different race
2 4 Black Struck by State White different race
3 4 White Not struck by State White same race
struck_by
1 Juror chosen to serve on jury
2 Struck by the state
3 Juror chosen to serve on jury
To get the data from one variable, we use the command dataset$variable
. For example, jurors$struck_state
gives us the data values from the struck_state
variable, which tells us if a juror was struck by the state from the jury pool. Here we can see the first 10 entries in this variable:
$struck_state[1:10] jurors
[1] "Not struck by State" "Struck by State" "Not struck by State"
[4] "Not struck by State" "Struck by State" "Not struck by State"
[7] "Struck by State" "Not struck by State" "Not struck by State"
[10] "Not struck by State"
(b). Table of counts and proportions
The summary
command used with a data frame gives summaries of each variable
summary(jurors)
trial__id race struck_state defendant_race
Min. : 4.0 Length:1517 Length:1517 Length:1517
1st Qu.: 52.0 Class :character Class :character Class :character
Median : 82.0 Mode :character Mode :character Mode :character
Mean :112.6
3rd Qu.:170.0
Max. :301.0
same_race struck_by
Length:1517 Length:1517
Class :character Class :character
Mode :character Mode :character
The table
command gives the distribution of counts for a single categorical variable. To obtain the count table for struck_state
you need to
<- table(jurors$struck_state)
counts counts
Not struck by State Struck by State
1084 433
We can add the prop.table
command to turn these counts into proportions:
prop.table(counts)
Not struck by State Struck by State
0.7145682 0.2854318
- What proportion of eligible jurors were struck by the state from the jury pool?
Click for answer
Answer: about 28.5% of eligible jurors were struck by the state.
(c). Bar graph for one variable
We can create a data frame count_data
containing the counts and their corresponding categories.
<- data.frame(counts)
count_data count_data
Var1 Freq
1 Not struck by State 1084
2 Struck by State 433
Then, we use ggplot2
to create a bar plot with the categories on the x-axis and the counts on the y-axis. The column names of count_data
are automatically assigned to be Var1
and Freq
. We can change the column names to category
and count
, for example, as:
colnames(count_data) = c("category", "count")
count_data
category count
1 Not struck by State 1084
2 Struck by State 433
The geom_bar(stat = "identity")
function is used to create the bars, and we set the y-axis label using labs()
.
# Create a bar plot using ggplot2
library(ggplot2) # load the package
ggplot(count_data, aes(x = category, y = count)) +
geom_bar(stat = "identity") +
labs(y = "count")
You may simply use geom_bar
if the count data is not available.
ggplot(jurors, aes(x = struck_state)) + geom_bar() + ylab("Count")
(d). Two-way tables
First 10 entries of race
and struck_state
variable is
1:10),(2:3)] jurors[(
race struck_state
1 Black Not struck by State
2 Black Struck by State
3 White Not struck by State
4 White Not struck by State
5 Black Struck by State
6 White Not struck by State
7 Black Struck by State
8 White Not struck by State
9 White Not struck by State
10 White Not struck by State
The table
command also gives two-way tables when two variables are included. Here is the two-way table for juror race and state struck status:
<- table(jurors$race, jurors$struck_state)
mytable mytable
Not struck by State Struck by State
Black 225 310
White 859 123
- How many jurors were white and were not struck by the state?
Click for answer
answer: 859
(e). Conditional proportions: state strike status by juror race
The prop.table
command gives conditional proportions for a two-way table. We plug our two-way table into prop.table
with a margin=1
to get proportions grouped by the row variable:
prop.table(mytable, margin = 1)
Not struck by State Struck by State
Black 0.4205607 0.5794393
White 0.8747454 0.1252546
Of all eligible black jurors, about 57.9% were struck by the state.
- What proportion of eligible white jurors were struck by the state?
Click for answer
answer: about 12.5%
- Is there evidence of an association between juror race and state strikes?
Click for answer
answer: Yes, there is an association because the rate of state strikes varies greatly by juror race with about 60% of black jurors were struck compared to only 13% of white jurors
(f). Stacked bar graph for two variables
We can visualize the conditional distribution from part (e) with a stacked bar graph created using the ggplot2
graphing package. First, load this package’s functions with the library
command:
library(ggplot2)
Now we can use the geom_bar
command in this package. Here we get the conditional distribution of struck_state
given race
:
ggplot(jurors, aes(x = race, fill = struck_state)) +
geom_bar(position = "fill") +
labs(title = "State strikes by juror race", y = "proportion",
x = "eligible juror race", fill = "struck by state?")
The basic syntax for this function is to let ggplot
know your data set name (jurors
), then specify the grouping or conditional variable on the x-axis (race
) in the aes
(aesthetic) argument. The fill
variable is the response variable (struck_state
). We add (+
) the geom_bar
geometry to get a bar graph with the fill
position specified. Adding an informative label and title complete the graph.
(g). Conditional distribution of race grouped by strike status
We can “flip” our response and grouping variables easily (if we think it makes sense to do so). Here we specify the margin=2
to get proportions grouped by the column variable:
prop.table(mytable, margin = 2)
Not struck by State Struck by State
Black 0.2075646 0.7159353
White 0.7924354 0.2840647
Notice that the proportions add to one down each column. Of all eligible jurors struck by the state, about 71.6% were black.
The stacked bar graph for this distribution is
ggplot(jurors, aes(x = struck_state, fill = race)) +
geom_bar(position = "fill") +
labs(title = "Juror race by state strikes", y = "proportion",
fill = "eligible juror race", x = "struck by state?")
- What proportion of eligible jurors who were not struck by the state were black? were white?
Click for answer
Answer: Of all jurors not struck by the state, about 20.8% were black
Problem 2: Graduate programs acceptance and sex
How are grad school program acceptance rates associated with sex? We will look at a classic data set from Berkeley grad school applications from 1973 (Science, 1975). The data cases are applicants to four graduate programs at Berkeley during 1973. The variable result
tells us if the applicant was accepted to the graduate program, sex
tells us the sex of the applicant (male or female), and program
tells us program type (programs 1,2,3 or 4).
<- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/BerkeleyGrad.csv") grad
(a). Table of counts and proportions
prop.table(table(grad$result))
accept reject
0.4260119 0.5739881
- What proportion of applicants were accepted?
Click for answer
Answer: About 43% (1284/3014) of applicants were accepted.
(b). Two-way tables
The table
command also gives two-way tables when two variables are included. Here is the two-way table for result and sex:
table(grad$sex, grad$result)
accept reject
female 262 587
male 1022 1143
- How many applicants involved females who were accepted?
Click for answer
Answer: : 262 applicants involved females who were accepted.
(c). Conditional proportions: acceptance given sex
The prop.table
command gives conditional proportions for a two-way table. First let’s save the two-way table in an object named mytable
:
<- table(grad$sex, grad$result) mytable
Then use prop.table
to get the distribution of result conditioned (grouped) on applicant’s sex:
prop.table(mytable, 1)
accept reject
female 0.3085984 0.6914016
male 0.4720554 0.5279446
The value of 1
in this command tell’s R that you want row proportions (the denominator of the proportion is each row total).
- What proportion of female were accepted?
Click for answer
Answer: about 31% (262/(262+587))
- What proportion of males were accepted?
Click for answer
Answer: about 47% (1022/(1022+1143))
(d). Bar graph for one variable
We can create a data frame count_data1
containing the counts and their corresponding categories.
<- table(grad$result)
counts1 <- data.frame(counts1) count_data1
Then, we use ggplot2
to create a bar plot with the categories on the x-axis and the counts on the y-axis. The column names of count_data
are automatically assigned to be Var1
and Freq
. We can change the column names to category
and count
, for example, as:
colnames(count_data1) = c("category", "count")
count_data1
category count
1 accept 1284
2 reject 1730
The geom_bar(stat = "identity")
function is used to create the bars, and we set the y-axis label using labs()
.
# Create a bar plot using ggplot2
library(ggplot2) # load the package
ggplot(count_data1, aes(x = category, y = count)) +
geom_bar(stat = "identity") +
labs(y = "count")
(e). Stacked bar graph for two variables
Now we can use the geom_bar
command in this package. Here we get the conditional distribution of result
given sex
:
library(ggplot2) # don't need if you already entered it for example 1
ggplot(grad, aes(x = sex, fill = result)) +
geom_bar(position = "fill") +
labs(y="Proportion", title = "result by sex", fill = "result?", x = "sex")
The basic syntax for this function is to let ggplot
know your data set name (grad
), then specify the grouping or conditional variable on the x-axis (sex
) in the aes
(aesthetic) argument. The fill
variable is the response variable (result
). We add (+
) the geom_bar
geometry to get a bar graph with the fill
position specified. Adding an informative label and title complete the graph.
(f). Subsetting by program type
Finally, we will repeat the previous analysis of result and sex, but this time we will divide (or subset) the data set by program type. To do this we need to know how the values of program
are coded:
table(grad$program)
program1 program2 program3 program4
933 585 782 714
Here we use the filter
command available from the dplyr
package to get only the applicants to program 1:
library(dplyr)
<- filter(grad, program == "program1") # gets rows where program equal program1
grad.p1 head(grad.p1)
program sex result
1 program1 male accept
2 program1 male accept
3 program1 male accept
4 program1 male accept
5 program1 male accept
6 program1 male accept
dim(grad.p1)
[1] 933 3
- Repeat the
filter
command to get a data set for program 2 and call the new data setgrad.p2
. Verify that the number of rows in this dataset matches the number of program 2 applicants in the original data set.
# enter R code for (f) here
<- filter(grad, program == "program2") # gets rows where program equal program1
grad.p2 head(grad.p2)
program sex result
1 program2 male accept
2 program2 male accept
3 program2 male accept
4 program2 male accept
5 program2 male accept
6 program2 male accept
(g). Result by sex for program 1.
- The distribution of result conditioned on applicant’s sex for the program 1 data set is shown below.
ggplot(grad.p1, aes(x = sex, fill = result)) +
geom_bar(position = "fill") +
labs(y="Proportion", title = "result by sex for program 1",
fill = "result?", x = "sex")
Get both a table of conditional proportions (or percentages) and a stacked bar graph.
Click for answer
prop.table(table(grad.p1$sex, grad.p1$result),1)
accept reject
female 0.8240741 0.1759259
male 0.6193939 0.3806061
(h). Result by sex for program 2.
- Repeat part (g) but this time use the program 2 data set. Compare the two bar graphs for (g) and (h) and explain how they show that females have a higher acceptance rate after accounting for program type (1 or 2).
Click for answer
# enter R code for (h) here
ggplot(grad.p2, aes(x = sex, fill = result)) +
geom_bar(position = "fill") +
labs(y="Proportion", title = "result by sex for program 2",
fill = "result?", x = "sex")
prop.table(table(grad.p2$sex, grad.p2$result),1)
accept reject
female 0.6800000 0.3200000
male 0.6285714 0.3714286
Answer: For both programs 1 and 2, we see that female applicants have a slightly higher rate of acceptance than male applicants. After accounting for program type, we now see that female applicants have higher acceptance rate than male applicants. Without accounting for program type, the opposite was true (see parts (c) and (e)).
Why? the confounding affect of program type which is associated with both result and sex:
Click for answer
- females prefer to apply to programs 3 and 4 while males prefer programs 1 and 2 (more than 3 and 4).
- 44% of females applied to program 3 and 40% to program 4
- 38% of males applied to program 1 and 26% to program 2
prop.table(table(grad$sex, grad$program), 1)
program1 program2 program3 program4
female 0.12720848 0.02944641 0.44169611 0.40164900
male 0.38106236 0.25866051 0.18799076 0.17228637
-Programs 3 and 4 were much harder to get into than programs 1 and 2 - 64% of applicants to program 1 were accepted and 63% of applicants to program 2 were accepted - 6% of applicants to program 4 were accepted and 34% of applicants to program 3 were accepted
prop.table(table(grad$program, grad$result), 1)
accept reject
program1 0.64308682 0.35691318
program2 0.63076923 0.36923077
program3 0.34398977 0.65601023
program4 0.06442577 0.93557423
So since the majority of females applied to the toughest programs (as measured by acceptance rates), there overall rate of acceptance was lower for females compared to males. But when we break down these rates by program type, we see that females have higher acceptance rates than males (see the visual in part (i)).
(i). A bar graph with three variables
If we simply want to graph the relationship between result and sex for each type of program, we can avoid subsetting the data by using the facet_wrap
command in ggplot2
. It is one simple addition to the stacked bar graph in part (e):
ggplot(grad, aes(x = sex, fill = result)) +
geom_bar(position = "fill") +
labs(y="Proportion",
title = "result by sex for each program",
fill = "result?",
x = "sex") +
facet_wrap(~program)
- Verify that this command creates side-by-side stacked bar graphs that match your graphs in parts (g) and (h) for programs 1 and 2.
Click for answer
Answer: The graphs match.