sqltableau

Honey Production in the United States from 1998-2021

Honey is one of the most delicious foods on this planet, but the little bees that make it are in danger thanks to Colony Collapse Disorder. So what's the current state of honey production in the United States?

GithubTableauKaggle
Honey Production in the United States from 1998-2021

Introduction

I love honey, and therefore bees, but the insect is in big trouble. In 2006, concern was raised across the globe regarding the rapid decline in the honeybee population due to Colony Collapse Disorder, a phenomenon of disappearing worker bees resulting in the entire hive dying. This phenomenon is still unexplained and could be due to a variety of factors including climate change, pesticide use, or an unidentified colony disease, but its effect on honey production presumably is great.

Goal

With this in mind, I want to look into the current state of honey production in the United States to determine if Colony Collapse Disorder has affected it.

Data Collection

I found a data set covering honey production in the United States from 1998-2012 that had been cleaned from the original source -- the National Agricultural Statistics Services (NASS), the primary data reporting body of the U.S. Department of Agriculture (USDA).

I decided to supplement this already clean data with more recent data. I sought out these reports on my own at USDA Economics, Statistics and Market Information System and downloaded all data from 2013-2021. While looking through the data, I noticed the USDA intermittently included other interesting figures that weren’t reported in the original data set covering 1998-2012 including additional income for honey harvesters(rates for pollination) and expenditure amounts (for various pest controls, feed for the bees, and on materials to construct additional hives). I downloaded this data as well and created a new dataset of my own.

Before cleaning the data, let's take a look at some of the main data involved:

  • numcol: number of honey producing colonies
  • yieldpercol: honey yield per colony, in pounds
  • total prod: total honey production (numcol * yieldpercol), in pounds
  • stocks: refers to stocks of honey held by producers, in pounds
  • priceperlb: price per pound of honey, in US dollars
  • prodvalue: monetary value of honey production (totalprod * priceperlb), in US dollars

Data Cleaning

I imported the data into MySQL and I began to clean the data so I can start my analysis.

I noticed the figures in my data set covering 2013-2021 and the previously collected data set covering 1998-2012 were in different units. I therefore made the necessary changes to update the data in my table to real pounds.

1UPDATE honey20132021
2SET numcol = numcol * 1000;
1UPDATE honey20132021
2SET totalprod = totalprod * 1000;
1UPDATE honey20132021
2SET prodvalue = prodvalue * 1000;

The price per pound was listed in cents, but it would be more helpful if it was in dollars.

1ALTER TABLE honey20132021
2MODIFY priceperlb FLOAT;
1UPDATE honey20132021
2SET priceperlb = priceperlb * .01;

I also noticed the original data set accidently miscalculated the stocks. This was a simple correction.

1UPDATE honey19982012
2SET stocks = stocks * .1;
1UPDATE honey20132021
2SET stocks = stocks * 1000;

Now that the values/units were set in each table, I also needed to deal with the variance in how the states were labeled. The data covering 1998-2012 had states listed by an abbreviation (Alabama as AL) while the data from 2013-2021 was listed by the state's full name (Alabama).

I decided to alter the former table to get the states listed by their full name.

Now that the data had been thoroughly cleaned and the units matched between both datasets, I merged the two tables to get started with my analysis.

1CREATE TABLE honey1998to2021
2SELECT * FROM honey19982012
3	UNION
4SELECT * FROM honey20132021;

Overview of Honey Production in the United States

Let's start by looking at the total number of bee colonies from 1998-2021 per each state.

1SELECT
2	state,
3	SUM(numcol) AS total_bee_colonies
4FROM
5	honey1998to2021
6GROUP BY stateidnum
7ORDER BY total_bee_colonies DESC;

This data has the total for all the United States. This could be helpful for later, but for this analysis, I’ve chosen to exclude it from my queries.

1SELECT
2	state,
3	SUM(numcol) AS total_bee_colonies_per_state
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY stateidnum
8ORDER BY total_bee_colonies_per_state DESC;
Graph of Total Bee Colonies per State in U.S.

North Dakota and California lap the field. In fact, North Dakota and California are home to 18,985,000 bee colonies -- 21.87% of all colonies in the U.S.

Are some states more efficient with their honey collection, though?

1SELECT
2	state,
3	AVG(yieldpercol) AS avg_yield_per_colony_per_state
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY stateidnum
8ORDER BY avg_yield_per_colony_per_state DESC;
Graph of Average honey yield per state in the U.S.

Turns out Hawai'i gets the most out of their colonies, but North Dakota still rates highly. On the other hand, California is surprisingly poor in this metric. Further research beyond the scope of this project here would be great to figure out why!

Moving on, which state produces the most honey?

1SELECT
2	state,
3	SUM(totalprod) AS total_prod_from_1998_to_2021
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY stateidnum
8ORDER BY total_prod_from_1998_to_2021 DESC;
Total amount of honey produced by state in the U.S.

No surprise that North Dakota and California are at the top with South Dakota close behind.

It should be noted, some states' data was occasionally not recorded by the USDA to avoid disclosing data for individual operations (ex. Connecticut has only one major producer). This means that it can be hard to determine the true total production for states at the bottom of the list; however, we can have good confidence in the data for the states that are big honey producers!

Of course, prices will vary greatly across the country:

1SELECT
2	state,
3	AVG(priceperlb) AS avg_price_per_pound
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY stateidnum
8ORDER BY avg_price_per_pound DESC;
Average price per pound of honey for each state in the U.S.

Honey is more expensive is less productive honey states. In fact, honey in North and South Dakota is some of the cheapest in the country. That's some good supply and demand economics if I've ever seen it!

Let's put it all together and determine which states are the economic centers of honey production from 1998-2021.

1SELECT
2	state,
3	SUM(prodvalue) AS total_economic_value
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY stateidnum
8ORDER BY total_economic_value DESC;
Total economic output of honey producing states in the U.S.

Again, no surprises with North Dakota, South Dakota, and California being the top dogs. But just how much of a strangle hold do these three states control on the U.S. honey economy?

1SELECT
2	SUM(prodvalue) AS total_economic_value
3FROM
4	honey1998to2021
5WHERE stateidnum IN (26, 31, 4);

Total economic output of South Dakota, North Dakota, and California combined = $2,277,405,000

1SELECT 
2	SUM(prodvalue) 
3FROM honey1998to2021;

Total U.S. honey market = $8,906,341,000

The top three states, South Dakota, North Dakota, and California combine for 25.57% of the total honey market in the United States.

That's a good overview of the major players in the honey market from 1998 to 2021. But we also have some great cross-sectional data. Let's see the changes that have happened over time in the United States.

In 2006 global concern was raised over the rapid decline in bee population. Can that be seen from this data?

1SELECT
2	year,
3	SUM(numcol) AS number_of_colonies_per_year
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY year
8ORDER BY year ASC;
Total number of honeybee colonies in the United States

Not really. Interestingly, in this dataset, the number of honey producing colonies refers to the maximum number of colonies from which honey was harvested from that year. This means it is possible to harvest honey from colonies that did not survive the entire year. Additionally, this data does not include the actual number of bees in each colony, so there could simply be less bees per hive. Unfortunately, I won’t be able to look further into this with the data I gathered. We can at least determine if the amount of honey produced has dropped.

1SELECT
2	year,
3	SUM(totalprod) AS total_honey_prod_per_year
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY year
8ORDER BY year ASC;
Total amount of honey produced in the United States

Honey production has tanked! The U.S. produced 57.61% less honey in 2021 than in 1998 with relatively the same number of colonies. So if the number of colonies has remained steady and the total honey production has gone down, the yield per colony also must have decreased.

1SELECT
2	year,
3	AVG(yieldpercol) AS avg_yield_per_year
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY year
8ORDER BY year ASC;
Average honey yield per colony across the United States

Check! 2021 (the last year of available data) marked the lowest average yield per colony while 1988 (the first year of data) was the highest average yield. This lack of honey production must surely have raised honey prices in the US over this time.

1SELECT
2	year,
3	AVG(priceperlb) AS avg_price_per_lb_per_year
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY year
8ORDER BY year ASC;
Average price per pound of honey in the United States

Prices have increased by 400%!

1SELECT
2	year,
3	SUM(prodvalue) AS total_prod_value_per_year
4FROM
5	honey1998to2021
6WHERE state != 'United States'
7GROUP BY year
8ORDER BY year ASC;
Total honey market in the United States

But despite less honey being produced, the total monetary value of the honey market in the United States consistently increased until relatively recently. The sky rocketing prices have more than been making up for the lack of honey...at least for those selling it.

Going Further

I mentioned previously that while collecting data for this project I also noticed additional, supplemental data that is worthwhile to briefly look at.

From 2015-2017 the USDA collected information relating to smaller producers of honey, those with less than 5 colonies. Unfortunately it's very granular data. Rather than a state by state breakdown, it's only national data. And with only three years of data, it will be hard to see any trends.

1SELECT * FROM smallfarms;
StateYearNumber of ColoniesHoney Yield Per ColonyTotal Honey Production
United States201720,00030599,000
United States201624,00031.9766,000
United States201523,00031.3720,000

Scroll to view the full table.

The number of colonies isn't that interesting, other than that it's slowly decreasing. If only there was more data! Maybe we can see a difference in efficiency between small scale producers and those with a more industrialized processes?

1SELECT
2	AVG(yieldpercol)
3FROM
4	smallfarms;
1SELECT 
2	AVG(yieldpercol)
3FROM
4	honey1998to2021;

Small farms averaged 31.06 lbs per colony. As we learned earlier, larger scall operations averaged 58.33 lbs per colony. That's a huge difference, and not in the mom-and-pop operations favor. It would be interesting to learn the methods or technological harvesting practices that contribute to this vast difference.

Since 2016 the USDA has also been collected various expenditures honey operations with more than five colonies have accrued. This data is not state specific but rather shows expenditures nationally.

1SELECT * FROM expenditures;
YearVarroa ControlOther IssuesFeedFoundationWood
202111,5650003,066,00042,582,0007,064,0009,863,000
202011,706,0003,166,00040,930,0006,567,00011,153,000
201916,366,0004,535,00058,010,0007,887,00011,939,000
201817,788,0004,825,00065,220,0009,534,00013,646,000
201717,240,0005,919,00053,075,0009,775,0009,989,000
201616,042,0006,647,00050,307,0007,294,0009,014,000

Scroll to view the full table.

The varroa mite is one of the huge bee killers, and producers are clearly spending heavily on reducing their impact on their hives. It's unfortunate we don't have access to more data on this as their outbreak in the United States occurred in 2006 and it would have been great to see how spending on this changed over time. Interestingly, though, spending on varroa control has decreased recently, hopefully because they're becoming less of a problem.

In fact, all of the expenditures have deceased recently despite the number of colonies remaining stable. Maybe bees are slowly becoming more resilient?

Conclusions

I was relatively surprised by what I found from this data. I expected to see a dramatic change in the number of colonies and honey production starting in 2006, but instead, the number of colonies has remained relatively stable while total honey produced and the yield per colony has been decreasing since 1998. Clearly, bees' problems go further than just Colony Collapse Disorder. Despite this though, honey as a money making industry has been doing relatively well. It may not be a great time to be a bee, but it ain't bad being their human managers.

Learning Take-Aways

Data type and unit consistency. The main hurdle in this project was combining two datasets populated with different data types and mismatched units. Making sure these were identical across the two datasets was hugely important and allowed the rest of the data analysis process to be achievable.

Kaggle is good, but knowing how to collect data independently is better. It's rare to have a perfectly manicured dataset covering what you want to analyze already prepared for you so knowing how and where to find the data you need is an important skill. Exploring the USDA website was a great step in learning this process.

Visualizations make for better data analysis. Stakeholders don't have the time to sift through rows and rows of data to make sense of my conclusions. That's what good, clean visualizations are for. At their best, visualizations are simple and easy to understand representations of the data working in tandem with my analysis to make stakeholders' decision-making process easier.

© 2025 Kyle Zweng. Built by Taylor McPherson.