2

I have the following DataFrame:

   A  B  C
0  1  3  3
1  1  9  4
2  4  6  3

I would like to create every possible unique combination of these columns without repetition so that I would end up with a dataframe containing the following data: A, B, C, A+B, A+C, B+C, A+B+C. I do not want to have any columns repeated in any combination, e.g. A+A+B+C or A+B+B+C.

I would also like to have each column in the dataframe labelled with the relevant variable names (e.g. for the combination of A + B, column name should be 'A_B')

This is the desired DataFrame:

   A  B  C  A_B  A_C  B_C  A_B_C
0  1  1  4    2    5    5      6
1  3  9  6   12    9   15     18
2  3  4  3    7    6    7     10

This is relatively easy with just 3 variables using itertools and I have used the following code to do it:

    import pandas as pd
    import itertools

    combos_2 = pd.DataFrame({'{}_{}'.format(a, b):
    df[a] + df[b] 
    for a, b in itertools.combinations(df.columns, 2)})

    combos_3 = pd.DataFrame({'{}_{}_{}'.format(a, b, c):
    df[a] + df[b] + df[c] 
    for a, b, c in itertools.combinations(df.columns, 3)})

    composites = pd.concat([df, combos_2, combos_3], axis=1)

However, I can't figure out how to extend this code in a pythonic way to account for a DataFrame with a much larger number of columns. Is there a way of making the following code more pythonic and extending it for use with a large number of columns? Or is there a more efficient way of generating the combinations?

2 Answers 2

3

We need first create the combination based on the columns , then create the dataframe

from itertools import combinations
input = df.columns
output = sum([list(map(list, combinations(input, i))) for i in range(len(input) + 1)], [])
output
Out[21]: [[], ['A'], ['B'], ['C'], ['A', 'B'], ['A', 'C'], ['B', 'C'], ['A', 'B', 'C']]
df1=pd.DataFrame({'_'.join(x) : df[x].sum(axis=1 ) for x in output if x !=[]})
df1
Out[22]: 
   A  B  C  A_B  A_C  B_C  A_B_C
0  1  3  3    4    4    6      7
1  1  9  4   10    5   13     14
2  4  6  3   10    7    9     13
Sign up to request clarification or add additional context in comments.

Comments

1

You were pretty close:

from itertools import chain, combinations

# Need to realize the generator to make sure that we don't
# read columns from the altered dataframe.
combs = list(chain.from_iterable(combinations(d.columns, i)
                                 for i in range(2, len(d.columns) + 1)))
for cols in combs:
    df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)

A word of precaution - if you combine columns with _ while the column names themselves can contain _, you're bound to have column name clashes sooner or later.

3 Comments

Thank you for answering my question but this does not give the desired dataframe. Instead it returns a dataframe with 26 columns (it should only have 7 columns as seen in the desired dataframe I showed in my original question). I might not have been clear enough in my question, but I only want each unique combination where the original columns are not repeated (i.e. there shouldn't be any columns with A + A + B + C).
I edited my question to specify that I only want combinations without repetitions of individual columns. Apologies for the confusion!
Oops, sorry about that - I omitted a call to list thinking it wouldn't be needed because of iteration. But I missed that the generator would see the changed dataframe on each iteration. I've edited the answer. An alternative would be to just create a new dataframe instead of changing the existing one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.