1

I have the following data frame

df = pd.DataFrame( {'Code Similarity & Clone Detection': {0: 0.0, 1: 0.0, 2: 0.0, 3: 1.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 1.0}, 'Code Navigation & Understanding': {0: 0.0, 1: 0.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 0.0}, 'Security': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'ANN': {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'CNN': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'RNN': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'LSTM': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 1.0, 7: 0.0, 8: 1.0, 9: 1.0}} )

I want to convert this data frame into a new one with three columns, the first column called "SE" which includes the head of the first 4 columns in df, The second column called 'DL' which includes the rest of the columns in df. the third column called 'count' which counts the occurrences for each SE and DL values that come together. The following figure is must be the new shape

enter image description here

2
  • How looks first 3 rows of expected ouput from sample data? Commented Apr 7, 2021 at 12:18
  • @jezrael I just added the expected output Commented Apr 7, 2021 at 12:26

1 Answer 1

1

Use:

#create MultiIndex by all combinations
mux = pd.MultiIndex.from_product([df.columns[:4], df.columns[4:]])

#repeat by first and second level with transpose
df1 = df.reindex(mux, axis=1, level=0).T
df2 = df.reindex(mux, axis=1, level=1).T

#sum together per columns, per MultiIndex
df=(df1.add(df2)
       .sum(axis=1)
       .sum(level=[0,1])
       .astype(int)
       .rename_axis(['SE','DL'])
       .reset_index(name='count'))
print (df.head(10))
                                  SE                   DL  count
0  Code Similarity & Clone Detection                  ANN      5
1  Code Similarity & Clone Detection                  CNN      5
2  Code Similarity & Clone Detection                  RNN      3
3  Code Similarity & Clone Detection                 LSTM      7
4  Code Similarity & Clone Detection  attention mechanism      9
5  Code Similarity & Clone Detection          Autoencoder      7
6  Code Similarity & Clone Detection                  GNN      6
7  Code Similarity & Clone Detection             Other_DL      4
8    Code Navigation & Understanding                  ANN      8
9    Code Navigation & Understanding                  CNN      8

EDIT: If need count 1 matching between use:

#in real data change 3 to 4 for select first 4 columns
mux = pd.MultiIndex.from_product([df.columns[:3], df.columns[3:]])

#repeat by first and second level with transpose
s1 = df.reindex(mux, axis=1, level=0).T.stack()
s2 = df.reindex(mux, axis=1, level=1).T.stack()

df = (s1[s1 == 1].eq(s2[s2 == 1]).sum(level=[0,1])
                 .rename_axis(['SE','DL'])
                 .sort_index(level=1)
                 .reset_index(name='count'))
print (df)
                                   SE    DL  count
0     Code Navigation & Understanding   ANN      2
1   Code Similarity & Clone Detection   ANN      0
2                            Security   ANN      2
3     Code Navigation & Understanding   CNN      0
4   Code Similarity & Clone Detection   CNN      0
5                            Security   CNN      3
6     Code Navigation & Understanding  LSTM      2
7   Code Similarity & Clone Detection  LSTM      1
8                            Security  LSTM      2
9     Code Navigation & Understanding   RNN      0
10  Code Similarity & Clone Detection   RNN      0
11                           Security   RNN      1
Sign up to request clarification or add additional context in comments.

12 Comments

Thanks for your answer but this is not what I'm looking for. I will edit the question by adding the final shape perspective
@Peter - Maybe you can use less column, less rows from sample data and add expected ouput from sample data.
Thanks for your effort, the shape is exactly what I want but the numbers are not logical. in the data frame, there are 10 samples, so the sum of the count might be 10. eventually, I want to count the number of time that (Code Similarity & Clone Detection and ANN come together which mean both have 1 value simultaneously )
@Peter - Is possible create small sample data and add expected ouput?
@Peter - I think here 5 rows and 5 columns should be perfect. Thank you.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.