Apologies for the ambiguous title.
I have a dataset of students and I want to run a clustering algorithm on the students.
The dataset is structured such that there are more than one row per student, each with age, grade (9th, 10th, etc) a single class the student is taking and the final score in that class.
In pre-processing I apply pd.get_dummies to get one column for each class students are taking with a boolean value and the score column stays as is.
I want to merge the rows such that for each student I only have one row (because I want to cluster over students, not each row) and instead of 1 or 0 for each class, I want the final score of that class to appear in the class column and then eliminate the score column.
I will try to present an example:
Name, Age, Grade, Class, Score
John, 16, 9, Biology, 98
John, 16, 9, Algebra, 95
John, 16, 9, French, 96
Applying pd.get_dummies results in the following columns:
Name, Age, Grade, Class_Biology, Class_Algebra, Class_French, Score
I am interested in the following result:
Name, Age, Grade, Class_Biology, Class_Algebra, Class_French
John, 16, 9, 98, 95, 96
Is there a more efficient way than iterating over the rows and manually creating a new row in the dataframe for each student?