Tidy Tuesday Exercise 2

Tidy Tuesday Part 2: Week of April 11, 2023

This weeks data is called: US Egg Production Data 2007-2021 and comes from The Humane League Labs US Egg Production Dataset.

Accessing Data

# Get the Data

# Read in directly

eggproduction  <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-11/egg-production.csv')
Rows: 220 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): prod_type, prod_process, source
dbl  (2): n_hens, n_eggs
date (1): observed_month

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cagefreepercentages <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-11/cage-free-percentages.csv')
Rows: 96 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): source
dbl  (2): percent_hens, percent_eggs
date (1): observed_month

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#get a summary of data
summary(eggproduction)
 observed_month        prod_type         prod_process           n_hens         
 Min.   :2016-07-31   Length:220         Length:220         Min.   : 13500000  
 1st Qu.:2017-09-30   Class :character   Class :character   1st Qu.: 17284500  
 Median :2018-11-15   Mode  :character   Mode  :character   Median : 59939500  
 Mean   :2018-11-14                                         Mean   :110839873  
 3rd Qu.:2019-12-31                                         3rd Qu.:125539250  
 Max.   :2021-02-28                                         Max.   :341166000  
     n_eggs             source         
 Min.   :2.981e+08   Length:220        
 1st Qu.:4.240e+08   Class :character  
 Median :1.155e+09   Mode  :character  
 Mean   :2.607e+09                     
 3rd Qu.:2.963e+09                     
 Max.   :8.601e+09                     
summary(cagefreepercentages)
 observed_month        percent_hens    percent_eggs       source         
 Min.   :2007-12-31   Min.   : 3.20   Min.   : 9.557   Length:96         
 1st Qu.:2017-05-23   1st Qu.:13.46   1st Qu.:14.521   Class :character  
 Median :2018-11-15   Median :17.30   Median :16.235   Mode  :character  
 Mean   :2018-05-12   Mean   :17.95   Mean   :17.095                     
 3rd Qu.:2020-02-28   3rd Qu.:23.46   3rd Qu.:19.460                     
 Max.   :2021-02-28   Max.   :29.20   Max.   :24.546                     
                                      NA's   :42                         
#Add some libraries

library(ggplot2)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   1.0.1      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Data Dictionary

I am copying and pasting the data dictionaries from the Tidy Tuesday github for reference here:

egg-production.csv

variable class description
observed_month double Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
prod_type character type of egg product: hatching, table eggs
prod_process character type of production process and housing: cage-free (organic), cage-free (non-organic), all. The value ‘all’ includes cage-free and conventional housing.
n_hens double number of hens produced by hens for a given month-type-process combo
n_eggs double number of eggs producing eggs for a given month-type-process combo
source character Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.

cage-free-percentages.csv

variable class description
observed_month double Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
percent_hens double observed or computed percentage of cage-free hens relative to all table-egg-laying hens
percent_eggs double computed percentage of cage-free eggs relative to all table eggs,This variable is not available for data sourced from the Egg Markets Overview report
source character Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.


#Some cleaning

#Separating out hatching and table eggs

hatching_eggsnhens <- subset(eggproduction, prod_type=='hatching eggs',c(observed_month, prod_process, n_eggs, n_hens)) 
hatching_eggsnhens$eggsperhen <- hatching_eggsnhens$n_eggs / hatching_eggsnhens$n_hens

hatching_eggsnhens
# A tibble: 55 × 5
   observed_month prod_process     n_eggs   n_hens eggsperhen
   <date>         <chr>             <dbl>    <dbl>      <dbl>
 1 2016-07-31     all          1147000000 57975000       19.8
 2 2016-08-31     all          1142700000 57595000       19.8
 3 2016-09-30     all          1093300000 57161000       19.1
 4 2016-10-31     all          1126700000 56857000       19.8
 5 2016-11-30     all          1096600000 57116000       19.2
 6 2016-12-31     all          1132900000 57750000       19.6
 7 2017-01-31     all          1123400000 57991000       19.4
 8 2017-02-28     all          1014500000 58286000       17.4
 9 2017-03-31     all          1128500000 58735000       19.2
10 2017-04-30     all          1097200000 59072000       18.6
# … with 45 more rows
table_eggsnhens <- subset(eggproduction, prod_type=='table eggs',c(observed_month,prod_process, n_eggs, n_hens)) 
table_eggsnhens$eggsperhen <- table_eggsnhens$n_eggs / table_eggsnhens$n_hens

tablehens1<- subset(eggproduction, prod_type=='table eggs',c(observed_month, prod_process, n_hens))  
tablehens2 <- tablehens1 %>% pivot_wider(names_from = prod_process, values_from = n_hens)%>% rename("all-table-n_hens" = "all", "cage-free-organic-table-n_hens" = "cage-free (organic)",  "cage-free-non-organic-table-n_hens" = "cage-free (non-organic)"       )


tableeggs1 <- subset(eggproduction, prod_type=='table eggs',c(observed_month, prod_process, n_eggs))  
tableeggs2 <- tableeggs1 %>% pivot_wider(names_from = prod_process, values_from = n_eggs)%>% rename("n_eggs_all" = "all", "n_eggs_cfo" = "cage-free (organic)",  "n_eggs_cfno" = "cage-free (non-organic)")



combined_data <-full_join(hatching_eggsnhens, table_eggsnhens, by = "observed_month")



cleandata <-full_join(combined_data, cagefreepercentages, by = "observed_month") %>% rename("percent all table egg laying hens cage free" = "percent_hens", "percent all table eggs organic" = "percent_eggs")                                                               

Data Explorations

Lets look at table hatching eggs and hens over time

ggplot(hatching_eggsnhens, aes(x=observed_month, y=eggsperhen, color="pink")) +geom_path() +geom_point()

It appears certain months have low egg production

#format as month
hatching_eggsnhens$month <- format(as.Date(hatching_eggsnhens$observed_month, format="%d/%m/%Y"),"%m")

ggplot(hatching_eggsnhens, aes(x =month , y = eggsperhen)) + geom_boxplot(fill = "grey92") + geom_point(size = 2, alpha = .15,position = position_jitter(seed = 1, width = .2)) 

Looks like something is weird in February: the median # of eggs per month is closer to 17, compared to >18 in all other months. Maybe this is due to an increased desire for more table eggs in those months.

Maybe we’ll use this in our model. But lets also explore some table egg information.

Lets look at the number of table eggs per hen over time

table_eggsnhensall <-subset(table_eggsnhens,prod_process=='all')

ggplot(table_eggsnhensall, aes(x=observed_month, y=eggsperhen, color="pink")) +geom_path() +geom_point()

#format as month
table_eggsnhensall$month <- format(as.Date(table_eggsnhensall$observed_month, format="%d/%m/%Y"),"%m")

ggplot(table_eggsnhensall, aes(x =month , y = eggsperhen)) + geom_boxplot(fill = "grey92") + geom_point(size = 2, alpha = .15,position = position_jitter(seed = 1, width = .2)) 

THe number of eggs per hen for table eggs is also lower in February! Strange. Its also interesting to note that the number of eggs per hen is much higher for table eggs than hatching eggs!

Lets also look at the number of table eggs in total over time

ggplot(tableeggs1, aes(x=observed_month, y=n_eggs, 
                      group=prod_process, color=prod_process)) +geom_path() +geom_point()

Some key take aways:

  1. Interestingly, looks like all table eggs was on an upward trend… until 2020.

  2. Cage free non-organic eggs seem to be increasing production faster than organic eggs

Some potential analyses:

  1. Potential analysis: ITS model looking at linear trend in egg production until March 2020 (start of COVID-19 pandemic)

  2. Difference in Difference model looking at slope of organic versus non-organic eggs

Lets look at a pre COVID-19 and post-COVID-19

#Add covid indicator for any month in March 2020 forward

tableeggs2$cdate <- as.Date("2020-02-28")
tableeggs2$covid_ind = ifelse(tableeggs2$observed_month>tableeggs2$cdate,1,0)
tableeggs2$n_eggs_all_m <- tableeggs2$n_eggs_all/1000000


tableeggs2$mindate = min(tableeggs2$observed_month)
tableeggs2$daysmodel = as.numeric (tableeggs2$observed_month-tableeggs2$mindate)
tableeggs2$interaction <- ifelse(tableeggs2$observed_month > tableeggs2$cdate,tableeggs2$daysmodel,0)

#Plotting interrupted time series model
ggplot(tableeggs2, aes(x=observed_month, y=n_eggs_all_m, 
                      group=factor(covid_ind),
                      color=factor(covid_ind))) +geom_point() + geom_vline(xintercept = tableeggs2$cdate, linetype="dashed", 
                color = "gray2", linewidth=1)+ geom_smooth(method="lm", se=FALSE) + ggtitle("Plot of table eggs produced over time") + xlab("Date") + ylab("Number of eggs (in millions)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 1 rows containing missing values (`geom_point()`).

Model approach

Here are my steps:

  1. Data cleaning for the model: See above.

  2. Assign 70% of data to training and 30% to test (make assignments by COVID_ind status so we have the same amount of data pre-post intervention)

  3. Calculate linear regression with the following covariates: observed_month(month in time series), covid_ind (a 0/1 indicator variable for if in teh COVID-19 time period), and an interaction between covid_ind and observed_mont

  4. Interpretation will be as follows:

    1. Parameter (observed_month): Slope (Trend) before the pandemic

    2. Parameter estimate (COVID_IND): Immediate change in egg production at start of pandemic

    3. Parameter estimate (interaction term): Change in slope (trend) during the pandemic

  5. Repeat model on test data to see if interpretation of parameter estimates are similar

  6. Summarize findings

#Modeling Approach
library(tidyverse)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
library(performance)

Attaching package: 'performance'
The following objects are masked from 'package:yardstick':

    mae, rmse
set.seed(777)

data_split <- initial_split(tableeggs2, prop = 2/3, strata=covid_ind)

# Create data frames for the two sets:
train_data <- training(data_split)
#36 obs
test_data  <- testing(data_split)
#20 obs

model_recipe <- recipe(n_eggs_all_m ~ daysmodel + covid_ind + interaction, data= train_data)

#model: linear regression using GLM engine
ln_mod <- linear_reg() %>% set_engine ("lm")
wflow <- workflow() %>% add_model(ln_mod) %>% add_recipe(model_recipe)

#Fitting the model to train dataset
model_fit_train <-  wflow %>% fit(data = train_data)
#Looking at model output;
model_fit_train %>% 
  extract_fit_parsnip() %>% 
  tidy()
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 7373.       85.1      86.6   1.51e-38
2 daysmodel      0.749     0.120     6.22  6.55e- 7
3 covid_ind   -653.     1205.       -0.542 5.92e- 1
4 interaction    0.180     0.831     0.216 8.30e- 1
#Applying the model to unseen test data
predict(model_fit_train, test_data)
# A tibble: 20 × 1
   .pred
   <dbl>
 1 7419.
 2 7464.
 3 7601.
 4 7646.
 5 7738.
 6 7761.
 7 7965.
 8 7988.
 9 8011.
10 8057.
11 8078.
12 8147.
13 8170.
14 8262.
15 8307.
16 7991.
17 8048.
18 8133.
19 8190.
20 8219.
#augment data
aug <- augment(model_fit_train, test_data)

#pull month and egg prediction
observed_daysmodel <- aug %>% pull(daysmodel)
pred <- aug %>% pull(.pred)

ggplot(aug, aes(x=.pred, y= n_eggs_all_m, group=factor(covid_ind), color=factor(covid_ind))) +
  geom_point() +
  geom_abline(intercept=0, slope=1) +
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')

#run rmse
rmse_vec(observed_daysmodel, pred)
[1] 7053.268

Summary of findings:

From our training dataset we estimate an increase of 0.74 million eggs per day prior to COVID-19. At the start of COVID-19 we saw an immediate decrease of 650 million eggs per day! Then, after COVID-19 we saw an increased trend of 0.18 million additional eggs per day (for a post-COVID-19 trend of 0.92 million more eggs produced each day!!).

Summary of our model fit:

In our plot of observed versus predicted values for the test data we feel confident that our model predicted well. There was potentially less good prediction in teh COVID-19 era than the pre-COVID-19 era but that is likely due to the small sample of data in that period for the model.