0

I have the following pandas dataframe covering more than 10k answers for 150 questions.

Pandas Dataframe

I am struggling to find a way to see the correlation between fields.

In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).

So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.

Any ideas?

1 Answer 1

1

You can see a linear regression by Pearson

necessary libraries

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Code

list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
    
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)

list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))

# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()

plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()

You will get something as follows where the column Coef. says to you how much the variables are correlated enter image description here

Another way is to see the matrix correlation

df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True 

plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)

ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")

plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()

enter image description here

An example for numeric columns

df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])

enter image description here

enter image description here

Sign up to request clarification or add additional context in comments.

8 Comments

Many thanks Samir! I tried the first method, but I am stacked on the following point: for a, b in itertools.combinations(VARIABLES, 2): NameError: name 'VARIABLES' is not defined maybe I should declare VARIABLES as the options of column 15? (1,2,3,4,5)
Sorry, I copy and paste and example based on my code where I use the code with a dict of features. I delete the loop. for a, b in itertools.combinations(VARIABLES, 2): Try again and let me know.
Wait a minute, I recently noticed that one of the column allows multiple choices so, it is not the right approach for this situation. Sorry!!
no problem Samir!!! Actually Thank you so much in advance!
Indeed if I run the code for ---> xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3) I get the error: KeyError: "None of [Index(['Q015'], dtype='object', name=0)] are in the [columns]"
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.