How do I make a correlation matrix for each subset of a column of my pandas dataframe?

Question

Here’s the head of my dataframe:

There are 100 different loggers and 10 different years. I want to subset the table by logger and find the Pearson correlation values for year by avg_max_temp, avg_min_temp, and tot_precipitation for each logger. Because there are 100 loggers, I’d expect the resulting dataframe to have 100 rows of 3 output columns as well as a column for the logger ID..

Here’s how I would do this analysis for all the data combined:

# Create a new dataframe with the correlation values
corr_df = pd.DataFrame(df.corr(method='pearson'))

corr_df.drop(['year', 'yield'], axis=1, inplace=True)
corr_df.drop(['avg_max_temp', 'avg_min_temp', 'tot_precipitation','yield'], axis=0, inplace=True)
# Print the dataframe
corr_df.head()

However, I can’t figure out how to do this for each of the 100 dataloggers. Any help would be hugely appreciated. Thanks in advance!

Derek O · Accepted Answer · 2023-03-14 16:46:57Z

1

You can loop through a groupby object to iterate through each portion of the df with a unique logger, and extract the Pearson correlation coefficients for each group, concatenating them together into your final corr_df DataFrame.

corr_df = pd.DataFrame()

for group, df_group in df.groupby('logger'):
    # Create a new dataframe with the correlation values
    group_corr_df = pd.DataFrame(df_group.corr(method='pearson'))

    group_corr_df.drop(['year', 'yield'], axis=1, inplace=True)
    group_corr_df.drop(['avg_max_temp', 'avg_min_temp', 'tot_precipitation','yield'], axis=0, inplace=True)
    group_corr_df['logger'] = group
    corr_df = pd.concat([corr_df, group_corr_df])

edited Mar 14, 2023 at 16:46

answered Mar 14, 2023 at 1:52

Derek O

20.2k4 gold badges32 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tom Over a year ago

Thanks so much for the reply here. This is almost there. It gives me a df with the three output columns. Do you know if there is a way to also include the logger's id as a fourth column? Or in other words, the string saying what group each trio of data corresponds with?

Derek O Over a year ago

@Trev we can add a new column called logger to group_corr_df just before we concat

Tom Over a year ago

Thank you @Derek O, that's a huge help. Is there any way you could explain how the loop works? What are group and df_group referring to in the original groupby dataframe?

Derek O Over a year ago

@Trev sure, i'm happy to explain a bit further. you are looping through portions of the dataframe where logger = {each unique logger value}. so if the first logger value is '011072.txt', then on the first iteration of the loop, group = '011072.txt' and df_group = df[df['logger'] == '011072.txt']

Collectives™ on Stack Overflow

How do I make a correlation matrix for each subset of a column of my pandas dataframe?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related