2

I originally started with following data frame:

Dataset is related to users answering multiple questions which have multiple answer choices and the user has the ability to answer more than one answer.

movie_id, user_id, rated_value, question_id, answer_id, genre, user_gender, user_ethnicity
101, 345, 3.5, 1, 1, comedy, male, white
101, 345, 3.5, 1, 2, comedy, male, white
101, 345, 3.5, 2, 1, comedy, male, white
125, 345, 4.5, 1, 4, drama, male, white
101, 233, 4.0, 1, 3, comedy, female, black
101, 233, 4.0, 2, 2, comedy, female, black
125, 233, 3.0, 1, 1, drama, female, black
125, 233, 3.0, 2, 2, drama, female, black
125, 333, 3.0, 1, 1, comedy, male, asian
125, 333, 3.0, 2, 2, comedy, male, asian 

I want to flatten this table by pivoting. I can successfully do without bringing in genre, user_gender, user_ethnicity as follows:

pivoted_df = df_to_pivot.assign(val=1).pivot_table(
    index=['movie_id',
           'user_id',
           'rated_value'],
    columns=['question_id',
             'answer_id'],
    values=['question_id', 'answer_id'],
    fill_value=0)

then to combine question and answer id so columns will reflect as 1_1, 1_2

pivoted_df.columns = pivoted_df.columns.droplevel()
pivoted_df.columns = ['{}_{}'.format(l1, l2).strip() for l1, l2 in pivoted_df.columns.values]
pivoted_df = pivoted_df.reset_index()

movie_id user_id rating_value 1_1 1_2 1_3 1_4...

but when tried to add genre, user_gender, user_ethnicity

pivoted_df = df_to_pivot.assign(val=1).pivot_table(
    index=['movie_id',
           'user_id',
           'rated_value'],
    columns=['question_id',
             'answer_id', 'genre', 'user_gender','user_ethnicity'],
    values=['question_id', 'answer_id', 'genre', 'user_gender','user_ethnicity'],
    fill_value=0)

it doesn't really work.

My goal is to pivot genre, user_gender, user_ethnicity just like the rest so columns would be movie_id user_id rated_value 1_1 1_2 1_3 1_4...comedy, drama...,male, female, black, white, asian

output: 
movie_id, user_id, rated_value , 1_1, 1_2, 1_3, 1_4, comedy, drama, male, female, white, black, asian
101, 345, 3.5, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0

Goal is to get movie_id, user_id pair per row and everything else reflect by 1 and zero.

2
  • 3
    Can you post your desired output dataframe? Even your examples doesn't actually run for me Commented Jun 28, 2018 at 19:53
  • i think i had a one column name incorrectly Commented Jun 28, 2018 at 20:00

1 Answer 1

1

Combine question_id and answer_id in to one column then use pd.get_dummies

df['QandA'] = df['question_id'].astype(str) + '_' + df['answer_id'].astype(str)

pd.get_dummies(df, columns=['QandA','genre','user_gender','user_ethnicity'])

Output:

   movie_id  user_id  rated_value  question_id  answer_id  QandA_1_1  QandA_1_2  QandA_1_3  QandA_1_4  QandA_2_1  QandA_2_2  genre_comedy  genre_drama  user_gender_female  \
0       101      345          3.5            1          1          1          0          0          0          0          0             1            0                   0   
1       101      345          3.5            1          2          0          1          0          0          0          0             1            0                   0   
2       101      345          3.5            2          1          0          0          0          0          1          0             1            0                   0   
3       125      345          4.5            1          4          0          0          0          1          0          0             0            1                   0   
4       101      233          4.0            1          3          0          0          1          0          0          0             1            0                   1   
5       101      233          4.0            2          2          0          0          0          0          0          1             1            0                   1   
6       125      233          3.0            1          1          1          0          0          0          0          0             0            1                   1   
7       125      233          3.0            2          2          0          0          0          0          0          1             0            1                   1   
8       125      333          3.0            1          1          1          0          0          0          0          0             1            0                   0   
9       125      333          3.0            2          2          0          0          0          0          0          1             1            0                   0   

   user_gender_male  user_ethnicity_asian  user_ethnicity_black  user_ethnicity_white  
0                 1                     0                     0                     1  
1                 1                     0                     0                     1  
2                 1                     0                     0                     1  
3                 1                     0                     0                     1  
4                 0                     0                     1                     0  
5                 0                     0                     1                     0  
6                 0                     0                     1                     0  
7                 0                     0                     1                     0  
8                 1                     1                     0                     0  
9                 1                     1                     0                     0  

I think you need pd.get_dummies:

pd.get_dummies(df, columns=['genre','user_gender','user_ethnicity'])

Output:

   movie_id  user_id  rated_value  question_id  answer_id  genre_comedy  genre_drama  user_gender_female  user_gender_male  user_ethnicity_asian  user_ethnicity_black  \
0       101      345          3.5            1          1             1            0                   0                 1                     0                     0   
1       101      345          3.5            1          2             1            0                   0                 1                     0                     0   
2       101      345          3.5            2          1             1            0                   0                 1                     0                     0   
3       125      345          4.5            1          4             0            1                   0                 1                     0                     0   
4       101      233          4.0            1          3             1            0                   1                 0                     0                     1   
5       101      233          4.0            2          2             1            0                   1                 0                     0                     1   
6       125      233          3.0            1          1             0            1                   1                 0                     0                     1   
7       125      233          3.0            2          2             0            1                   1                 0                     0                     1   
8       125      333          3.0            1          1             1            0                   0                 1                     1                     0   
9       125      333          3.0            2          2             1            0                   0                 1                     1                     0   

   user_ethnicity_white  
0                     1  
1                     1  
2                     1  
3                     1  
4                     0  
5                     0  
6                     0  
7                     0  
8                     0  
9                     0  
Sign up to request clarification or add additional context in comments.

3 Comments

i want question_id and answer_id pivot using column value as column headers. May be i do that as two step process. first do question_id and answer_id as i successfully did then use get_dummies this still has multiple rows for user_id and same movie rated by the user.
@Null-Hypothesis 1_1 and 1_2 is QuestionID_AnswerID combination?
Hopefully that helped. I am sure you can handle it from here. Happy coding!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.