2

I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.

F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.

For the sake of clearity, I have simplified the problem to image below: enter image description here

Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.

I tried also this (not working):

variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()

Any help is welcome!

PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.

Greetings,

Nadia

3 Answers 3

3

What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is

col_choice = ["Sex", "Age", "BMI"]

for pos, axis1 in enumerate(col_choice):   # Pick a first col
    for axis2 in enumerate(col_choice[pos+1:]):   # Pick a later col
        plt.scatter(df.loc[:, axis1], df.loc[:, axis2])

I think this generates a series acceptable to scatter.

Does that help? If you want to be more "Pythonic", then look into itertools.product to generate your column choices.

Sign up to request clarification or add additional context in comments.

3 Comments

Whoops! Sorry; you need to access the column named by the variable value axis1 ... you know how to do that?
That's because you specified a particular (constant) column, rather than taking the value from the axis variables, as my solutions specifies.
Thanks, Prune. It worked. For the reader, one has to adjust for axis2, choosing the second column.
3

You could do something like this:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create dummy dataframe, or load your own with pd.read_csv()

columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)


x_col = "sex"
y_columns = ["age", "BMI", "smoke"]


for y_col in y_columns:

    figure = plt.figure
    ax = plt.gca()
    ax.scatter(data[x_col], data[y_col])
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_title("{} vs {}".format(x_col, y_col))

    plt.legend()
    plt.show()

Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).

Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering

1 Comment

Thanks for your help neko. As for now, pycharm is not providing the plots for some reason...
0

You could use combinations from itertools. This way you will get an iterator with tuples of the combinations.

from itertools import combinations


print(list(combinations(df.columns, 2)))

The code you need would look like this:

from itertools import combinations


for col1, col2 in combinations(df.columns, 2): # <-----
    plt.scatter(df[col1], df[col2])
    plt.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.