Worst. Analysis. Ever: An Exploration of Simpsons Episodes in R

data analysis
tidy tuesday
r
doh
Author

Liam Cottrell

Published

February 4, 2025

Intro

I have officially undertaken my first data analysis project - looking at some Simpsons data for #TidyTuesday. I’ve been learning R for a grand total of two weeks so this is not particularly ✨ excellent ✨ (Mr Burns voice) but we all start somewhere!

Image credit

Overview

The dataset comprises the following attributes:

# find the column titles
names(episodes)
 [1] "id"                     "image_url"              "imdb_rating"           
 [4] "imdb_votes"             "number_in_season"       "number_in_series"      
 [7] "original_air_date"      "original_air_year"      "production_code"       
[10] "season"                 "title"                  "us_viewers_in_millions"
[13] "video_url"              "views"                 

There are 600 episodes in the dataset.

# find the first and last episodes
first_ep <- episodes %>%
  filter(id == min(id)) %>%
  pull(title)

first_ep_year <- episodes %>%
  filter(id == min(id)) %>%
  pull(original_air_year)

last_ep <- episodes %>%
  filter(id == max(id)) %>%
  pull(title)

last_ep_year <- episodes %>%
  filter(id == max(id)) %>%
  pull(original_air_year)

The episodes range from Episode 1 (‘Simpsons Roasting on an Open Fire’) in 1989, to Episode 600 (‘Treehouse of Horror XXVII’) in 2016.

IMDb Ratings

What is the highest rated episode of all time?

# find highest IMDb ratings
highest_rated_one <- episodes %>%
  filter(imdb_rating == max(imdb_rating, na.rm = TRUE)) %>%
  slice(1) %>%
  pull(title)

highest_rated_one_season <- episodes %>%
  filter(imdb_rating == max(imdb_rating, na.rm = TRUE)) %>%
  slice(1) %>%
  pull(season)
  
highest_rated_two <- episodes %>%
  filter(imdb_rating == max(imdb_rating, na.rm = TRUE)) %>%
  slice(2) %>%
  pull(title)

highest_rated_two_season <- episodes %>%
  filter(imdb_rating == max(imdb_rating, na.rm = TRUE)) %>%
  slice(2) %>%
  pull(season)

highest_rating <- episodes %>%
  filter(imdb_rating == max(imdb_rating, na.rm = TRUE)) %>%
  slice(1) %>%
  pull(imdb_rating)

Two episodes hold the honour of being highest rated: Homer’s Enemy and You Only Move Twice, both from season 8 and both scoring 9.2/10.

What is the lowest rated episode of all time?

# find lowest IMDb ratings
lowest_rated <- episodes %>%
  filter(imdb_rating == min(imdb_rating, na.rm = TRUE)) %>%
  pull(title)

lowest_rated_season <- episodes %>%
  filter(imdb_rating == min(imdb_rating, na.rm = TRUE)) %>%
  pull(season)

lowest_rating <- episodes %>%
  filter(imdb_rating == min(imdb_rating, na.rm = TRUE)) %>%
  pull(imdb_rating)

Season 23’s Lisa Goes Gaga, with just 4.5/10, is the lowest rated episode.

IMDb ratings over time

# plot ratings over time by season 
ggplot(episodes, aes(x = episodes_id, y = episodes$imdb_rating, color = factor(episodes$season))) +
  geom_line() +
  scale_color_viridis_d()
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_line()`).

Episode Titles

# find character names in episode titles
main_characters <- c("Homer", "Marge", "Bart", "Lisa", "Maggie")
name_counts <- sapply(main_characters, function(p) sum(grepl(p, episodes$title)))
name_freq <- data.frame(Name = names(name_counts), Count = name_counts)

Whose name appears most in episode titles?

# visualise frequency of character names appearing in episode titles
ggplot(name_freq, aes(x = Name, y = Count)) +
  geom_bar(stat = "identity") +
  theme_minimal()