I am trying to figure out how to return multiple columns that correspond with the desired aggregate functions, max of a sum, in SQL.
Based on data from CDC's Serotypes of concern: Illnesses and Outbreaks, I want to know what food caused the most illnesses for each year, so each row would look like: the year, the food category, and the total number of illnesses, which is basically max(sum(No_of_illnesses)). Sample data (found in CDC link above):
| table_id | Food_category | Year_first_ill | Serotype | No_of_illnesses | No_of_outbreak | Pathogen | Yr | Year_range | Running_total_by_year_range |
|---|---|---|---|---|---|---|---|---|---|
| Pork_Adelaide_2011-2015 | Pork | 2011 | Adelaide | 0 | 0 | Salmonella | 2020 | 2011-2015 | 0 |
| Pork_Adelaide_2011-2015 | Pork | 2012 | Adelaide | 0 | 0 | Salmonella | 2020 | 2011-2015 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Chicken_Anatum_2011-2015 | Chicken | 2011 | Anatum | 0 | 0 | Salmonella | 2020 | 2011-2015 | 0 |
In the end, what I'd like returned is all three columns for the max(Total_Illnesses) grouped by year with its corresponding food category, so the result would look partially like this:
| Year | Food | Total_Illnesses |
|---|---|---|
| 2011 | Chicken | 545 |
| 2012 | Chicken | 544 |
| ... | ... | ... |
| 2022 | Beef | 384 |
| 2023 | Chicken | 113 |
In order to do that, I wrote a sub-query that summed the number illnesses by food category for that year and then tried to find the max of those sums. The two suggestions I've read online is a) grouping by both columns and b) the window function. My two attempts:
select Year_first_ill, Food_category, max(total)
from (select Year_first_ill, Food_category, sum(No_of_illnesses) as total
from salmonella
group by Food_category, Year_first_ill)s
group by Food_category, Year_first_ill
which doesn't return the max, but essentially returns the sums from the sub-query table (but in a different, odd order--I don't know if this is relevant, but: the data is grouped by food for 2011-20, but for the last 3 years, it's grouped by year):
| Year | Food | Total_Illnesses |
|---|---|---|
| 2011 | Pork | 238 |
| 2012 | Pork | 14 |
| ... | ... | ... |
| 2011 | Chicken | 545 |
| 2012 | Chicken | 544 |
| ... | ... | ... |
select Year_first_ill, Food_category, MAX(total) OVER (PARTITION BY Year_first_ill)
from (select Year_first_ill, Food_category, sum(No_of_illnesses) as total
from salmonella
group by Food_category, Year_first_ill)s
which returns the correct max for each year, but repeated for each food:
| Year | Food | Total_Illnesses |
|---|---|---|
| 2011 | Turkey | 545 |
| 2011 | Pork | 545 |
| 2011 | Beef | 545 |
| 2011 | Chicken | 545 |
| ... | ... | ... |
| 2023 | Pork | 113 |
| 2023 | Chicken | 113 |
| 2023 | Beef | 113 |
I am unable to correctly return all three columns. So, how can I return multiple columns that correspond to nesting aggregate functions in SQL?
Note: I am using DB Fiddle, MySQL v8. Link to code.
select year_first_ill, food_category, total from (select year_first_ill, food_category, sum(no_of_illnesses) as total, max(sum(no_of_illnesses)) over (partition by year_first_ill) as max_total from salmonella group by food_category, year_first_ill) s where total = max_total order by year_first_ill;