4

I have two dataframes, each row in dataframe A has a list of indices corresponding to entries in dataframe B and a set of other values. I want to join the two dataframes in a way so that each of the entries in B has the other values in A where the index of the entry in B is in the list of indices in the entry in A.

So far, I have found a way of extracting the rows in B for the list of indices in each row in A but only row-by-row from this answer but then I am not sure where to go from here? Also not sure if there's a better way of doing it with Pandas dynamically as the size of the list of indices can change.

import pandas as pd
import numpy as np

# Inputs
A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})

print(A)
>>    indices a1   a2
>> 0  [0, 1]  a  100
>> 1  [2, 3]  b  200
>> 2  [4, 5]  c  300

B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})

print(B)
>>     b
>> 0  10
>> 1  20
>> 2  30
>> 3  40
>> 4  50
>> 5  60

# This is the desired output
out = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60],
    "a1": ["a","a", "b", "b", "c", "c"],
    "a2": [100,100,200,200,300,300]
})

print(out)
>>      b a1   a2
>> 0  10  a  100
>> 1  20  a  100
>> 2  30  b  200
>> 3  40  b  200
>> 4  50  c  300
>> 5  60  c  300

5 Answers 5

5

If you have pandas >=0.25, you can use explode:

C = A.explode('indices')

This gives:

  indices a1   a2
0       0  a  100
0       1  a  100
1       2  b  200
1       3  b  200
2       4  c  300
2       5  c  300

Then do:

output = pd.merge(B, C, left_index = True, right_on = 'indices')
output.index = output.indices.values    
output.drop('indices', axis = 1, inplace = True)

Final Output:

    b a1   a2
0  10  a  100
1  20  a  100
2  30  b  200
3  40  b  200
4  50  c  300
5  60  c  300
Sign up to request clarification or add additional context in comments.

8 Comments

imo this is the best answer, for people using >=0.25. explode is how you would perform this operation in other frameworks (like spark, for instance). I think in general it's better to use drop over del to remove a column though, correct?
Thanks Jeff! Yes, you are right. Changed it to drop :)
I'm going to accept this one because of @Jeff 's comment, it's similar to how other frameworks handle such an operation.
@AnkurSinha just realised, the indices are wrong, they should be [0,1,2,3,4,5] not [0,0,1,1,2,2]. Looks like the output dataframe is using C's indices not B's
Or you can also use: output.index = output.indices.values before you drop the output.indices column.
|
3

using pd.merge

df2 = pd.DataFrame(A.set_index(['a1','a2']).indices)

df = pd.DataFrame(df2.indices.values.tolist(), index=a.index).stack().reset_index().drop('level_2', axis=1).set_index(0)

pd.merge(B,df,left_index=True, right_index=True)

Output

    b a1   a2
0  10  a  100
1  20  a  100
2  30  b  200
3  40  b  200
4  50  c  300
5  60  c  300

Comments

2

Here you go:

helper = A.indices.apply(pd.Series).stack().reset_index(level=1, drop=True)
A = A.reindex(helper.index).drop(columns=['indices'])
A['indices'] = helper
B = B.merge(A, left_index=True, right_on='indices').drop(columns=['indices']).reset_index(drop=True)

Result:

    b   a1  a2
0   10  a   100
1   20  a   100
2   30  b   200
3   40  b   200
4   50  c   300
5   60  c   300

Comments

2

You can also use melt instead of stack, but it's more complicated as you must drop columns you don't need:

import pandas as pd
import numpy as np

# Inputs
A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})

B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})

AA = pd.concat([A.indices.apply(pd.Series), A], axis=1)
AA.drop(['indices'], axis=1, inplace=True)
print(AA)

   0  1 a1   a2
0  0  1  a  100
1  2  3  b  200
2  4  5  c  300

AA = AA.melt(id_vars=['a1', 'a2'], value_name='val').drop(['variable'], axis=1)
print(AA)

  a1   a2  val
0  a  100    0
1  b  200    2
2  c  300    4
3  a  100    1
4  b  200    3
5  c  300    5

pd.merge(AA.set_index(['val']), B, left_index=True, right_index=True)

Out[8]: 
  a1   a2   b
0  a  100  10
2  b  200  30
4  c  300  50
1  a  100  20
3  b  200  40
5  c  300  60

Comments

1

This solution will handle indices of varying lengths.

A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})
A = A.indices.apply(pd.Series) \
    .merge(A, left_index = True, right_index = True) \
    .drop(["indices"], axis = 1)\
    .melt(id_vars = ['a1', 'a2'], value_name = "index")\
    .drop("variable", axis = 1)\
    .dropna()
A = A.set_index('index')
B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})
B
B.merge(A,left_index=True,right_index=True)

Final Output:

    b   a1  a2
0   10  a   100
1   20  a   100
2   30  b   200
3   40  b   200
4   50  c   300
5   60  c   300

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.