--- title: "Analyzing Hass" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Analyzing Hass} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig_width = 8 ) ``` The {avocado} package provides a weekly summary - starting from January 2017 through November 2020 - of Hass Avocado sales. There are three datasets in this package and let's start with the dataset `hass` which focuses on weekly avocado sales in various cities/sub-regions across various regions (as defined by the [Hass Avocado Board](https://hassavocadoboard.com)) of the contiguous US. The `region` variable in the `hass` dataset can be 'merged' with the `hass_region` dataset using the `region` variable in both. Do note that the `location` variable only consists of a sample of cities/sub-regions found within respective regions. Therefore, the totals in this dataset will NOT equal the totals found in the `hass_region` dataset. Let's start by loading the package - along with {dplyr} (for data wrangling) and {ggplot} (for data visualization) - and exploring it's structure ```{r setup} library(avocado) library(dplyr) library(ggplot2) data('hass') dplyr::glimpse(hass) ``` ## Exploratory Data Analysis Let's begin by exploring the following topic: - Average Sales Price for Chicago over Time - Total Revenue for Locations within Great Lakes Region, excluding Chicago - Total Revenue by Month by Year within Great Lakes Region, excluding Chicago ### Average Sales Price - Chicago ```{r fig.height=5, fig.width=8} hass %>% filter(location == 'Chicago') %>% ggplot(aes(x = week_ending)) + geom_line(aes(y = avg_price_nonorg, color = 'Non Organic')) + geom_line(aes(y = avg_price_org, color = 'Organic')) + scale_color_manual(name = 'Type', values = c('Non Organic' = 'steelblue', "Organic" = 'darkgreen')) + labs(x = 'Week Ending', y = 'Average Selling Price (US$)', title = 'Average Selling Price Over Time - Chicago', caption = "Source: Hass Avocado Board\nNot adjusted for inflation") + theme(plot.background = element_rect(fill = "grey20"), plot.title = element_text(color = "#FFFFFF"), axis.title = element_text(color = "#FFFFFF"), axis.text.x = element_text(color = 'grey50', angle = 45, hjust = 1), axis.text.y = element_text(color = 'grey50'), plot.caption = element_text(color = 'grey75'), panel.background = element_blank(), panel.grid.major = element_line(color = "grey50", size = 0.2), panel.grid.minor = element_line(color = "grey50", size = 0.2), legend.background = element_rect(fill = 'grey20'), legend.key = element_rect(fill = 'grey20'), legend.title = element_text(color = 'grey75'), legend.text = element_text(color = 'grey75'), legend.position = c(0.815, 0.85) ) ``` 2017 does seem to have some peculiarities in terms of price. However, that can be attributed to a shortage in avocados. While this may seem a bit plain, let's look at the total revenue for the other communities within the Great Lakes region and exclude Chicago - as it is a far larger city. ### Total Revenue for Great Lakes Region without Chicago ```{r fig.height=5, fig.width=8} hass %>% filter(region == 'Great Lakes' & location != 'Chicago') %>% mutate( year = lubridate::year(week_ending) ) %>% group_by( year, location ) %>% summarize( total_revenue = sum((avg_price_nonorg * (plu4046 + plu4225 + plu4770 + small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag)) + (avg_price_org * (plu94046 + plu94225 + plu94770 + small_org_bag + large_org_bag + xlarge_org_bag))) / 1000000, .groups = 'drop' ) %>% ggplot(aes(x = year, y = total_revenue, group = location)) + geom_bar(aes(fill = location), color = 'white', stat = 'identity', position = 'dodge') + labs(x = "Year", y = 'Total Revenue (US$)', fill = 'Location', title = 'Total Revenue within Great Lakes Region', caption = 'Souce: Hass Avocado Board\nNot adjusted for inflation') + theme(plot.background = element_rect(fill = "grey20"), plot.title = element_text(color = "#FFFFFF"), axis.title = element_text(color = "#FFFFFF"), axis.text.x = element_text(color = 'grey50', angle = 45, hjust = 1), axis.text.y = element_text(color = 'grey50'), plot.caption = element_text(color = 'grey75'), panel.background = element_blank(), panel.grid.major = element_line(color = "grey50", size = 0.2), panel.grid.minor = element_line(color = "grey50", size = 0.2), legend.background = element_rect(fill = 'grey20'), legend.key = element_rect(fill = 'grey20'), legend.title = element_text(color = 'grey75'), legend.text = element_text(color = 'grey75') ) ``` Now this is interesting! Grand Rapids had more total revenue than Detroit in 2017, but in the years following, it seemed to take a big hit in revenue! ### Revenue by Month - Great Lakes Perhaps it may be interesting to dive down into the monthly level for each year to see the total revenue. Is there a particular month that stands out? Once again, we'll exclude Chicago. Let's find out. ```{r fig.height=5, fig.width=8} hass %>% filter(region == 'Great Lakes' & location != 'Chicago') %>% mutate( month = lubridate::month(week_ending, label = T, abbr = F), year = lubridate::year(week_ending) ) %>% group_by(year, month, location) %>% summarize( total_revenue = sum((avg_price_nonorg * (plu4046 + plu4225 + plu4770 + small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag)) + (avg_price_org * (plu94046 + plu94225 + plu94770 + small_org_bag + large_org_bag + xlarge_org_bag))) / 1000000, .groups = 'drop' ) %>% ggplot(aes(x = month, y = total_revenue, group = location, color = location)) + geom_line() + facet_wrap(.~year, scales = 'free') + labs(fill = 'Location', x = 'Month', y = 'Total Revenue (US$)', caption = 'Source: Hass Avocado Board\nNot adjusted for inflation', title = 'Total Revenue per Year by Month - Great Lakes Region') + theme(plot.background = element_rect(fill = "grey20"), plot.title = element_text(color = "#FFFFFF"), axis.title = element_text(color = "#FFFFFF"), axis.text.x = element_text(color = 'grey50', angle = 45, hjust = 1), axis.text.y = element_text(color = 'grey50'), plot.caption = element_text(color = 'grey75'), panel.background = element_blank(), panel.grid.major = element_line(color = "grey50", size = 0.2), panel.grid.minor = element_line(color = "grey50", size = 0.2), legend.background = element_rect(fill = 'grey20'), legend.key = element_rect(fill = 'grey20'), legend.title = element_text(color = 'grey75'), legend.text = element_text(color = 'grey75'), strip.background = element_rect(fill = 'grey50'), strip.text = element_text(color = 'black') ) ``` It seems as if the revenue declines in the winter months. This could be due to reduced supply and demand. On the other hand, during the late spring and summer months, revenue is at its highest. Maybe guacamole is a summer dish? Not for me!