library(ggplot2)
sleep <- read.csv("http://math.carleton.edu/Stats215/Textbook/SleepStudy.csv")
ggplot(sleep, aes(x=AverageSleep)) +
  geom_histogram(fill="steelblue", bins = 30) +
  labs(title = "Distribution of Sleep Hours", x = "Hours of Sleep")
This histogram shows the distribution of hours or sleep per night for a large sample of students.
library(ggplot2)
sleep <- read.csv("http://math.carleton.edu/Stats215/Textbook/SleepStudy.csv")
ggplot(sleep, aes(x=AverageSleep)) +
  geom_histogram(fill="steelblue", bins = 30) +
  labs(title = "Distribution of Sleep Hours", x = "Hours of Sleep")
Answer: The mean is around 8 hours
Answer: Most of the data is between about 6 and 10, with a mean around 8 (due to the roughly symmetric distribution). So two standard deviations is about 2 hours of sleep, making one standard deviation about 1 hours of sleep.
Let’s check the rule! Here are the actual mean and SD:
mean(sleep$AverageSleep)[1] 7.965929
sd(sleep$AverageSleep)[1] 0.9648396
The ACT test has a population mean of 21 and standard deviation of 5. The SAT has a population mean of 1500 and a standard deviation of 325. You earned 28 on the ACT and 2100 on the SAT.
Answer:
z_ACT <- (28 - 21) / 5
z_SAT <- (2100 - 1500) / 325
z_ACT[1] 1.4
z_SAT[1] 1.846154
Answer:
ACT_lower <- 21 - 2 * 5
ACT_upper <- 21 + 2 * 5
SAT_lower <- 1500 - 2 * 325
SAT_upper <- 1500 + 2 * 325
c(ACT_lower, ACT_upper)[1] 11 31
c(SAT_lower, SAT_upper)[1]  850 2150
For the given vector of observations indicate whether the resulting data appear to be symmetric, skewed to the right, or skewed to the left.
(2, 10, 15, 20, 69, 34, 23, 2, 45)
my_vector <- c(2, 10, 15, 20, 69, 34, 23, 2, 45)
summary(my_vector)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   10.00   20.00   24.44   34.00   69.00 
Answer: Skewed right. It has a longer right tail than left since max -Q3 >> Q1 - min
ggplot(data.frame(x=my_vector), aes(x)) + geom_boxplot()
This boxplot shows the number of hot dogs eaten by the winners of Nathan’s Famous hot dog eating contests from 2002-2011.
hotdogs <- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/HotDogs.csv")
ggplot(hotdogs, aes(x = "", y = HotDogs)) +
  geom_boxplot() +
  labs(title = "Number of Hot Dogs Consumed", y = "Number of Hot Dogs") 
Answer:
hotdog_q1 <- quantile(hotdogs$HotDogs, 0.25); hotdog_q125% 
 54 
hotdog_q3 <- quantile(hotdogs$HotDogs, 0.75); hotdog_q375% 
 65 
hotdog_iqr <- IQR(hotdogs$HotDogs); hotdog_iqr[1] 11
lower_fence <- hotdog_q1 - 1.5 * hotdog_iqr; lower_fence 25% 
37.5 
upper_fence <- hotdog_q3 + 1.5 * hotdog_iqr; upper_fence 75% 
81.5 
library(dplyr)
outliers <- filter(hotdogs, HotDogs < lower_fence | HotDogs > upper_fence)
outliers[1] Year    HotDogs
<0 rows> (or 0-length row.names)
Let’s visit the WorldGross analysis from the Hollywood movies data set:
movies <- read.csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/HollywoodMovies2011.csv")WorldGross.Answer:
ggplot(movies, aes(x = WorldGross, y = "")) +
  geom_boxplot() +
  labs(title = "World Gross of Hollywood Movies", x = "World Gross (in millions)", y ="") 
How many movies are identified as outliers for world gross?
Use the boxplot outlier rule to find the “fence” (cutoff) between an outlier and non-outlier for WorldGross. Then determine the value (of WorldGross) that the upper “whisker” (non-outlier) extends to.
Answer:
library(tidyr)
movies_no_na <- drop_na(movies)   # drop missing values
q1_world_gross <- quantile(movies_no_na$WorldGross, 0.25)
q3_world_gross <- quantile(movies_no_na$WorldGross, 0.75)
iqr_world_gross <- IQR(movies_no_na$WorldGross)
lower_fence_world_gross <- q1_world_gross - 1.5 * iqr_world_gross
upper_fence_world_gross <- q3_world_gross + 1.5 * iqr_world_gross
outliers <- filter(movies_no_na, WorldGross < lower_fence_world_gross | WorldGross > upper_fence_world_gross)
outliers                                          Movie              LeadStudio
1   Harry Potter and the Deathly Hallows Part 2             Warner Bros
2                          The Hangover Part II      Legendary Pictures
3                       Twilight: Breaking Dawn             Independent
4                Transformers: Dark of the Moon     DreamWorks Pictures
5                                           Rio        20th Century Fox
6                Rise of the Planet of the Apes        20th Century Fox
7                                    The Smurfs Sony Pictures Animation
8                               Kung Fu Panda 2    DreamWorks Animation
9  Pirates of the Caribbean:\nOn Stranger Tides                  Disney
10                           Mission Impossible               Paramount
11                            Sherlock Holmes 2             Warner Bros
12                                         Thor                  Disney
13                                       Cars 2                   Pixar
   RottenTomatoes AudienceScore             Story     Genre TheatersOpenWeek
1              96            92           Rivalry   Fantasy             4375
2              35            58            Comedy    Comedy             3615
3              26            68              Love   Romance             4061
4              35            67             Quest    Action             4088
5              71            73             Quest Animation             3826
6              83            87           Revenge    Action             3648
7              23            50 Fish Out Of Water Animation             3395
8              82            80           Rivalry Animation             3925
9              34            61             Quest    Action             4155
10             93            86           Pursuit    Action             3448
11             60            79           Pursuit    Action             3703
12             77            80     Monster Force    Action             3955
13             38            56 Fish Out Of Water Animation             4115
   BOAverageOpenWeek DomesticGross ForeignGross WorldGross Budget Profitability
1              38672        381.01       947.10   1328.111    125     10.624888
2              23775        254.46       327.00    581.464     80      7.268300
3              34012        260.80       374.00    634.800    110      5.770909
4              23937        352.39       770.81   1123.195    195      5.759974
5              10252        143.62       341.02    484.634     90      5.384822
6              15024        176.70       304.52    481.226     93      5.174473
7              10489        142.61       419.54    562.158    110      5.110527
8              12142        165.25       497.78    663.024    150      4.420160
9              21697        241.07       802.80   1043.871    250      4.175484
10              8672        197.80       336.70    534.500    145      3.686207
11             10704        179.04       261.00    440.040    125      3.520320
12             16618        181.03       267.48    448.512    150      2.990080
13             16072        191.45       360.40    551.850    200      2.759250
   OpeningWeekend
1          169.19
2           85.95
3          138.12
4           97.85
5           39.23
6           54.81
7           35.61
8           47.66
9           90.15
10          29.55
11          39.63
12          65.72
13          66.14
movies_no_outliers that contains only the rows from movies_no_na where the WorldGross values are within the range defined by the lower and upper fences.Answer:
library(dplyr)
movies_no_outliers <- filter(movies_no_na, WorldGross >= lower_fence_world_gross & WorldGross <= upper_fence_world_gross)We can compare boxplots of WorldGross across Genre categories:
ggplot(movies, aes(x = Genre, y = WorldGross)) +
  geom_boxplot() +
  labs(title = "World Gross by Genre", x = "Genre", y = "World Gross (in millions)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
WorldGross and Genre?WorldGross and Genre?