Practice Problems 5

Problem 1: Sleep

This histogram shows the distribution of hours or sleep per night for a large sample of students.

library(ggplot2)
sleep <- read.csv("http://math.carleton.edu/Stats215/Textbook/SleepStudy.csv")
ggplot(sleep, aes(x=AverageSleep)) +
  geom_histogram(fill="steelblue", bins = 30) +
  labs(title = "Distribution of Sleep Hours", x = "Hours of Sleep")

(a) Estimate the average hours of sleep per night.

Click for answer

Answer: The mean is around 8 hours

(b) Use the 95% rule to estimate the standard deviation for this data.

Click for answer

Answer: Most of the data is between about 6 and 10, with a mean around 8 (due to the roughly symmetric distribution). So two standard deviations is about 2 hours of sleep, making one standard deviation about 1 hours of sleep.

Let’s check the rule! Here are the actual mean and SD:

mean(sleep$AverageSleep)
[1] 7.965929
sd(sleep$AverageSleep)
[1] 0.9648396

Problem 2: Z-scores for Test Scores

The ACT test has a population mean of 21 and standard deviation of 5. The SAT has a population mean of 1500 and a standard deviation of 325. You earned 28 on the ACT and 2100 on the SAT.

(a) Which test did you do better on?

Click for answer

Answer:

  • ACT: The z-score for the score of 28 is z = (28 - 21)/5 = 1.4.
  • SAT: The z-score for the score of 2100 is z = (2100 - 1500)/325 = 1.85.
  • The SAT score is 1.85 standard deviations above average while the ACT score is only 1.4 standard deviations above. You did better on the SAT.
z_ACT <- (28 - 21) / 5
z_SAT <- (2100 - 1500) / 325
z_ACT
[1] 1.4
z_SAT
[1] 1.846154

(b) For each test, find the interval that is likely to contain about 95% of all test scores.

Click for answer

Answer:

  • ACT: Two standard deviations is 2(5) = 10. About 95% of ACT scores are between 21 - 10 = 11 and 21 + 10 = 31. This claim assumes that ACT scores follow a bell-shaped distribution.
  • SAT: Two standard deviations is 2(325) = 650. About 95% of SAT scores are between 1500 - 650 = 850 and 1500 + 650 = 2150. This claim assumes that SAT scores follow a bell-shaped distribution.
ACT_lower <- 21 - 2 * 5
ACT_upper <- 21 + 2 * 5
SAT_lower <- 1500 - 2 * 325
SAT_upper <- 1500 + 2 * 325
c(ACT_lower, ACT_upper)
[1] 11 31
c(SAT_lower, SAT_upper)
[1]  850 2150

Problem 3: 5 number summaries

For the given vector of observations indicate whether the resulting data appear to be symmetric, skewed to the right, or skewed to the left.

(2, 10, 15, 20, 69, 34, 23, 2, 45)

my_vector <- c(2, 10, 15, 20, 69, 34, 23, 2, 45)
summary(my_vector)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   10.00   20.00   24.44   34.00   69.00 
Click for answer

Answer: Skewed right. It has a longer right tail than left since max -Q3 >> Q1 - min

ggplot(data.frame(x=my_vector), aes(x)) + geom_boxplot()

Problem 4: Hot dog

This boxplot shows the number of hot dogs eaten by the winners of Nathan’s Famous hot dog eating contests from 2002-2011.

hotdogs <- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/HotDogs.csv")
ggplot(hotdogs, aes(x = "", y = HotDogs)) +
  geom_boxplot() +
  labs(title = "Number of Hot Dogs Consumed", y = "Number of Hot Dogs") 

(a) Use the boxplot to estimate the 5 number summary and IQR for this data. Verify that there are no outliers in this data.

Click for answer

Answer:

hotdog_q1 <- quantile(hotdogs$HotDogs, 0.25); hotdog_q1
25% 
 54 
hotdog_q3 <- quantile(hotdogs$HotDogs, 0.75); hotdog_q3
75% 
 65 
hotdog_iqr <- IQR(hotdogs$HotDogs); hotdog_iqr
[1] 11
lower_fence <- hotdog_q1 - 1.5 * hotdog_iqr; lower_fence
 25% 
37.5 
upper_fence <- hotdog_q3 + 1.5 * hotdog_iqr; upper_fence
 75% 
81.5 
library(dplyr)
outliers <- filter(hotdogs, HotDogs < lower_fence | HotDogs > upper_fence)
outliers
[1] Year    HotDogs
<0 rows> (or 0-length row.names)

Problem 5: Hollywood Movies World Gross

Let’s visit the WorldGross analysis from the Hollywood movies data set:

movies <- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/HollywoodMovies2011.csv")

(a) Draw a boxplot of WorldGross.

Click for answer

Answer:

ggplot(movies, aes(x = WorldGross, y = "")) +
  geom_boxplot() +
  labs(title = "World Gross of Hollywood Movies", x = "World Gross (in millions)", y ="") 

How many movies are identified as outliers for world gross?

(b) Calculating boxplot values

Use the boxplot outlier rule to find the “fence” (cutoff) between an outlier and non-outlier for WorldGross. Then determine the value (of WorldGross) that the upper “whisker” (non-outlier) extends to.

Click for answer

Answer:

library(tidyr)
movies_no_na <- drop_na(movies)   # drop missing values
q1_world_gross <- quantile(movies_no_na$WorldGross, 0.25)
q3_world_gross <- quantile(movies_no_na$WorldGross, 0.75)
iqr_world_gross <- IQR(movies_no_na$WorldGross)
lower_fence_world_gross <- q1_world_gross - 1.5 * iqr_world_gross
upper_fence_world_gross <- q3_world_gross + 1.5 * iqr_world_gross

outliers <- filter(movies_no_na, WorldGross < lower_fence_world_gross | WorldGross > upper_fence_world_gross)
outliers
                                          Movie              LeadStudio
1   Harry Potter and the Deathly Hallows Part 2             Warner Bros
2                          The Hangover Part II      Legendary Pictures
3                       Twilight: Breaking Dawn             Independent
4                Transformers: Dark of the Moon     DreamWorks Pictures
5                                           Rio        20th Century Fox
6                Rise of the Planet of the Apes        20th Century Fox
7                                    The Smurfs Sony Pictures Animation
8                               Kung Fu Panda 2    DreamWorks Animation
9  Pirates of the Caribbean:\nOn Stranger Tides                  Disney
10                           Mission Impossible               Paramount
11                            Sherlock Holmes 2             Warner Bros
12                                         Thor                  Disney
13                                       Cars 2                   Pixar
   RottenTomatoes AudienceScore             Story     Genre TheatersOpenWeek
1              96            92           Rivalry   Fantasy             4375
2              35            58            Comedy    Comedy             3615
3              26            68              Love   Romance             4061
4              35            67             Quest    Action             4088
5              71            73             Quest Animation             3826
6              83            87           Revenge    Action             3648
7              23            50 Fish Out Of Water Animation             3395
8              82            80           Rivalry Animation             3925
9              34            61             Quest    Action             4155
10             93            86           Pursuit    Action             3448
11             60            79           Pursuit    Action             3703
12             77            80     Monster Force    Action             3955
13             38            56 Fish Out Of Water Animation             4115
   BOAverageOpenWeek DomesticGross ForeignGross WorldGross Budget Profitability
1              38672        381.01       947.10   1328.111    125     10.624888
2              23775        254.46       327.00    581.464     80      7.268300
3              34012        260.80       374.00    634.800    110      5.770909
4              23937        352.39       770.81   1123.195    195      5.759974
5              10252        143.62       341.02    484.634     90      5.384822
6              15024        176.70       304.52    481.226     93      5.174473
7              10489        142.61       419.54    562.158    110      5.110527
8              12142        165.25       497.78    663.024    150      4.420160
9              21697        241.07       802.80   1043.871    250      4.175484
10              8672        197.80       336.70    534.500    145      3.686207
11             10704        179.04       261.00    440.040    125      3.520320
12             16618        181.03       267.48    448.512    150      2.990080
13             16072        191.45       360.40    551.850    200      2.759250
   OpeningWeekend
1          169.19
2           85.95
3          138.12
4           97.85
5           39.23
6           54.81
7           35.61
8           47.66
9           90.15
10          29.55
11          39.63
12          65.72
13          66.14
  1. Create a new dataset called movies_no_outliers that contains only the rows from movies_no_na where the WorldGross values are within the range defined by the lower and upper fences.
Click for answer

Answer:

library(dplyr)
movies_no_outliers <- filter(movies_no_na, WorldGross >= lower_fence_world_gross & WorldGross <= upper_fence_world_gross)

(d) Side-by-side boxplot

We can compare boxplots of WorldGross across Genre categories:

ggplot(movies, aes(x = Genre, y = WorldGross)) +
  geom_boxplot() +
  labs(title = "World Gross by Genre", x = "Genre", y = "World Gross (in millions)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • What does this type of graph illustrate well about the relationship between WorldGross and Genre?
Click for answer Answer: Does a good job comparing median values and extremes
  • What does this type of graph not illustrate well about the relationship between WorldGross and Genre?
Click for answer Answer: It doesn’t illustrate sample sizes well, e.g. the fantasy genre only has 2 movies in it