2

Apologies for the ambiguous title.

I have a dataset of students and I want to run a clustering algorithm on the students.

The dataset is structured such that there are more than one row per student, each with age, grade (9th, 10th, etc) a single class the student is taking and the final score in that class.

In pre-processing I apply pd.get_dummies to get one column for each class students are taking with a boolean value and the score column stays as is.

I want to merge the rows such that for each student I only have one row (because I want to cluster over students, not each row) and instead of 1 or 0 for each class, I want the final score of that class to appear in the class column and then eliminate the score column.

I will try to present an example:

Name, Age, Grade, Class, Score
John, 16, 9, Biology, 98
John, 16, 9, Algebra, 95
John, 16, 9, French, 96

Applying pd.get_dummies results in the following columns:

Name, Age, Grade, Class_Biology, Class_Algebra, Class_French, Score

I am interested in the following result:

Name, Age, Grade, Class_Biology, Class_Algebra, Class_French
John, 16, 9, 98, 95, 96

Is there a more efficient way than iterating over the rows and manually creating a new row in the dataframe for each student?

1 Answer 1

2

You can use set_index + unstack + add_prefix:

df = (df.set_index(['Name','Age','Grade', 'Class'])['Score']
        .unstack()
        .add_prefix('Class_')
        .reset_index()
        .rename_axis(None, axis=1))
print (df)

   Name  Age  Grade  Class_Algebra  Class_Biology  Class_French
0  John   16      9             95             98            96
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you very much! Is there a way to do this without detailing the specific columns? I didn't want to make a complex example, so apologies if this was relevant information - there are other columns that don't have to be transformed. For example, participation in extra curriculars - suppose we have gymnastics, swimming and soccer. With pd.get_dummies they transform to activity_gymnatics, etc. and they should remain as binary columns.
There is possible select column by positions only, change (df.set_index(['Name','Age','Grade', 'Class'])['Score'] to df = (df.set_index(df.columns[:4].tolist())[df.columns[4]].
I see you edited comment. Can you change your data with expected output?
Thank you! I see I was a little unclear in my example (made one up since I can't disclose true details). A better example would be:
Prev comment messed up- I have around 60 columns, only a few of which I want to convert to this format. So far, pandas pivot_table is doing a good job (data looks similar to yours) but presents a different challenge I hope to sort out: 'pd.pivot_table(df, index=['Name'], columns=['Score'], values=['Age','Grade'] Score Age Grade Class Algebra Biology French Algebra Biology French Algebra Biology French Name John 95 98 96 16 16 16 9 9 9'
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.