11

I'm trying to take one dataframe and create another, with all possible combinations of the columns and the difference between the corresponding values, i.e on 11-apr column AB should be (B-A)= 0 etc.

e.g, starting with

        Dt              A           B           C          D
        11-apr          1           1           1          1
        10-apr          2           3           1          2

how do I get a new frame that looks like this:

desired result

I have come across the below post, but have not been able to transpose this to work for columns.

Aggregate all dataframe row pair combinations using pandas

1
  • Any thoughts on how to do this for 3 columns, so let's say i want to do 2*B - A - C in the above example? Commented Jul 9, 2018 at 12:23

4 Answers 4

18

You can use:

from itertools import combinations
df = df.set_index('Dt')

cc = list(combinations(df.columns,2))
df = pd.concat([df[c[1]].sub(df[c[0]]) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
print (df)
        AB  AC  AD  BC  BD  CD
Dt                            
11-apr   0   0   0   0   0   0
10-apr   1  -1   0  -2  -1   1
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for this, works perfectly. Any thoughts on how to modify this for 3 combinations, e.g ABC, ABD, BCD etc and then rather than (B-A) having 2* B - C - A.
do you think cc = list(combinations(df.columns,3)) ?
and then df.columns = df.columns.map('-'.join) ?
I've got the list working no problem, but on pd.concat([df[c[2]].sub(df[c[1]]) I'm struggling to work in a third reference.
How can I do the same with all more variables (more combinations) AND add the numbers (or strings alternatively) instead of subtract them? e.g. A B C D E AB AC AD ..... ABCDE ? @jezrael
|
9

Make sure your index is Dt

df = df.set_index('Dt')

Using numpys np.tril_indices and slicing See below for explanation of np.triu_indices

v = df.values

i, j = np.tril_indices(len(df.columns), -1)

We can create a pd.MultiIndex for the columns. This makes it more generalizable for column names that are longer than one character.

pd.DataFrame(
    v[:, i] - v[:, j],
    df.index,
    [df.columns[j], df.columns[i]]
)

        A     B  A  B  C
        B  C  C  D  D  D
Dt                      
11-apr  0  0  0  0  0  0
10-apr  1 -1 -2  0 -1  1

But we can also do

pd.DataFrame(
    v[:, i] - v[:, j],
    df.index,
    df.columns[j] + df.columns[i]
)

        AB  AC  BC  AD  BD  CD
Dt                            
11-apr   0   0   0   0   0   0
10-apr   1  -1  -2   0  -1   1

np.tril_indices explained

This is a numpy function that returns two arrays that when used together, provide the locations of a lower triangle of a square matrix. This is handy when doing manipulations of all combinations of things as this lower triangle represents all combinations of one axis of a matrix with the other.

Consider the dataframe d for illustration

d = pd.DataFrame(np.array(list('abcdefghijklmnopqrstuvwxy')).reshape(-1, 5))
d

   0  1  2  3  4
0  a  b  c  d  e
1  f  g  h  i  j
2  k  l  m  n  o
3  p  q  r  s  t
4  u  v  w  x  y

The triangle indices, when looked at like coordinate pairs, looks like this

i, j = np.tril_indices(5, -1)
list(zip(i, j))

[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

I can manipulate ds values with i and j

d.values[i, j] = 'z'
d

   0  1  2  3  4
0  a  b  c  d  e
1  z  g  h  i  j
2  z  z  m  n  o
3  z  z  z  s  t
4  z  z  z  z  y

And you can see it targeted just that lower triangle

naive time test

enter image description here

Comments

1

itertools.combinations will help you:

import itertools
pd.DataFrame({'{}{}'.format(a, b): df[a] - df[b] for a, b in itertools.combinations(df.columns, 2)})

Which results in:

        AB  AC  AD  BC  BD  CD
Dt                            
11-apr   0   0   0   0   0   0
10-apr  -1   1   0   2   1  -1

1 Comment

This one works well if you have additional conditions such as df = pd.DataFrame({'{}{}'.format(a, b): df[a] & df[b] for a, b in itertools.combinations(df.columns, 2) if (df[a] & df[b]).any() }). The column labels won't get messed up like the previous answers.
1

Itertools module should help you to create the required combinations/permutations.

from itertools import combinations

# Creating a new pd.DataFrame
new_df = pd.DataFrame(index=df.index)

# list of columns
columns = df.columns

# Create all combinations of length 2 . eg. AB, BC, etc.
for combination in combinations(columns, 2):
    combination_string = "".join(combination)
    new_df[combination_string] = df[combination[1]]-df[combination[0]]
    print new_df


         AB  AC  AD  BC  BD  CD
Dt                            
11-apr   0   0   0   0   0   0
10-apr   1  -1   0  -2  -1   1

1 Comment

Although slower than Languitar's answer from above, this is much more readable. Thank you @Nipun for your excellent answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.