pandas dataframe - merging rows by substituting values with column value

Question

Apologies for the ambiguous title.

I have a dataset of students and I want to run a clustering algorithm on the students.

The dataset is structured such that there are more than one row per student, each with age, grade (9th, 10th, etc) a single class the student is taking and the final score in that class.

In pre-processing I apply pd.get_dummies to get one column for each class students are taking with a boolean value and the score column stays as is.

I want to merge the rows such that for each student I only have one row (because I want to cluster over students, not each row) and instead of 1 or 0 for each class, I want the final score of that class to appear in the class column and then eliminate the score column.

I will try to present an example:

Name, Age, Grade, Class, Score
John, 16, 9, Biology, 98
John, 16, 9, Algebra, 95
John, 16, 9, French, 96

Applying pd.get_dummies results in the following columns:

Name, Age, Grade, Class_Biology, Class_Algebra, Class_French, Score

I am interested in the following result:

Name, Age, Grade, Class_Biology, Class_Algebra, Class_French
John, 16, 9, 98, 95, 96

Is there a more efficient way than iterating over the rows and manually creating a new row in the dataframe for each student?

jezrael · Accepted Answer · 2018-01-24 08:35:19Z

2

You can use set_index + unstack + add_prefix:

df = (df.set_index(['Name','Age','Grade', 'Class'])['Score']
        .unstack()
        .add_prefix('Class_')
        .reset_index()
        .rename_axis(None, axis=1))
print (df)

   Name  Age  Grade  Class_Algebra  Class_Biology  Class_French
0  John   16      9             95             98            96

answered Jan 24, 2018 at 8:35

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

sa_zy Over a year ago

Thank you very much! Is there a way to do this without detailing the specific columns? I didn't want to make a complex example, so apologies if this was relevant information - there are other columns that don't have to be transformed. For example, participation in extra curriculars - suppose we have gymnastics, swimming and soccer. With pd.get_dummies they transform to activity_gymnatics, etc. and they should remain as binary columns.

jezrael Over a year ago

There is possible select column by positions only, change (df.set_index(['Name','Age','Grade', 'Class'])['Score'] to df = (df.set_index(df.columns[:4].tolist())[df.columns[4]].

jezrael Over a year ago

I see you edited comment. Can you change your data with expected output?

sa_zy Over a year ago

Thank you! I see I was a little unclear in my example (made one up since I can't disclose true details). A better example would be:

sa_zy Over a year ago

Prev comment messed up- I have around 60 columns, only a few of which I want to convert to this format. So far, pandas pivot_table is doing a good job (data looks similar to yours) but presents a different challenge I hope to sort out: 'pd.pivot_table(df, index=['Name'], columns=['Score'], values=['Age','Grade'] Score Age Grade Class Algebra Biology French Algebra Biology French Algebra Biology French Name John 95 98 96 16 16 16 9 9 9'

|

Collectives™ on Stack Overflow

pandas dataframe - merging rows by substituting values with column value

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related