Reshaping the pandas dataframe from a binary columns into statistical

Question

I have the following data frame

df = pd.DataFrame( {'Code Similarity & Clone Detection': {0: 0.0, 1: 0.0, 2: 0.0, 3: 1.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 1.0}, 'Code Navigation & Understanding': {0: 0.0, 1: 0.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 0.0}, 'Security': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'ANN': {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'CNN': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'RNN': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'LSTM': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 1.0, 7: 0.0, 8: 1.0, 9: 1.0}} )

I want to convert this data frame into a new one with three columns, the first column called "SE" which includes the head of the first 4 columns in df, The second column called 'DL' which includes the rest of the columns in df. the third column called 'count' which counts the occurrences for each SE and DL values that come together. The following figure is must be the new shape

How looks first 3 rows of expected ouput from sample data?

jezrael
– jezrael

2021-04-07 12:18:39 +00:00
Commented Apr 7, 2021 at 12:18 — jezrael
– jezrael, Commented Apr 7, 2021 at 12:18
@jezrael I just added the expected output

Peter
– Peter

2021-04-07 12:26:21 +00:00
Commented Apr 7, 2021 at 12:26 — Peter
– Peter, Commented Apr 7, 2021 at 12:26

jezrael · Accepted Answer · 2021-04-08 06:33:32Z

1

Use:

#create MultiIndex by all combinations
mux = pd.MultiIndex.from_product([df.columns[:4], df.columns[4:]])

#repeat by first and second level with transpose
df1 = df.reindex(mux, axis=1, level=0).T
df2 = df.reindex(mux, axis=1, level=1).T

#sum together per columns, per MultiIndex
df=(df1.add(df2)
       .sum(axis=1)
       .sum(level=[0,1])
       .astype(int)
       .rename_axis(['SE','DL'])
       .reset_index(name='count'))
print (df.head(10))
                                  SE                   DL  count
0  Code Similarity & Clone Detection                  ANN      5
1  Code Similarity & Clone Detection                  CNN      5
2  Code Similarity & Clone Detection                  RNN      3
3  Code Similarity & Clone Detection                 LSTM      7
4  Code Similarity & Clone Detection  attention mechanism      9
5  Code Similarity & Clone Detection          Autoencoder      7
6  Code Similarity & Clone Detection                  GNN      6
7  Code Similarity & Clone Detection             Other_DL      4
8    Code Navigation & Understanding                  ANN      8
9    Code Navigation & Understanding                  CNN      8

EDIT: If need count 1 matching between use:

#in real data change 3 to 4 for select first 4 columns
mux = pd.MultiIndex.from_product([df.columns[:3], df.columns[3:]])

#repeat by first and second level with transpose
s1 = df.reindex(mux, axis=1, level=0).T.stack()
s2 = df.reindex(mux, axis=1, level=1).T.stack()

df = (s1[s1 == 1].eq(s2[s2 == 1]).sum(level=[0,1])
                 .rename_axis(['SE','DL'])
                 .sort_index(level=1)
                 .reset_index(name='count'))
print (df)
                                   SE    DL  count
0     Code Navigation & Understanding   ANN      2
1   Code Similarity & Clone Detection   ANN      0
2                            Security   ANN      2
3     Code Navigation & Understanding   CNN      0
4   Code Similarity & Clone Detection   CNN      0
5                            Security   CNN      3
6     Code Navigation & Understanding  LSTM      2
7   Code Similarity & Clone Detection  LSTM      1
8                            Security  LSTM      2
9     Code Navigation & Understanding   RNN      0
10  Code Similarity & Clone Detection   RNN      0
11                           Security   RNN      1

edited Apr 8, 2021 at 6:33

answered Apr 7, 2021 at 12:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Peter Over a year ago

Thanks for your answer but this is not what I'm looking for. I will edit the question by adding the final shape perspective

jezrael Over a year ago

@Peter - Maybe you can use less column, less rows from sample data and add expected ouput from sample data.

Peter Over a year ago

Thanks for your effort, the shape is exactly what I want but the numbers are not logical. in the data frame, there are 10 samples, so the sum of the count might be 10. eventually, I want to count the number of time that (Code Similarity & Clone Detection and ANN come together which mean both have 1 value simultaneously )

jezrael Over a year ago

@Peter - Is possible create small sample data and add expected ouput?

jezrael Over a year ago

@Peter - I think here 5 rows and 5 columns should be perfect. Thank you.

|

Collectives™ on Stack Overflow

Reshaping the pandas dataframe from a binary columns into statistical

1 Answer 1

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related