1

I have two dataframes. Each row in Dataframe A is a package of products, and Dataframe B consists of product ids and their sellers' ids.

Dataframe A:

package_name | product_1 | product_2 | product_3 | product_4
package a    |     12    |     15    |    NaN    |    NaN
package b    |     17    |     16    |    14     |    NaN
package c    |     12    |     11    |    17     |    19

Dataframe B:

product_id | seller_id
12         | seller1
15         | seller1
12         | seller2
15         | seller2
17         | seller3
16         | seller3
14         | seller3

(Each product can have multiple sellers, and each seller has multiple products.)

I want to know which sellers have products of packages (based on Dataframe A). This is what expected:

Dataframe C:

package_name | product_1 | product_2 | product_3 | product_4 | seller_id
package a    |     12    |     15    |    NaN    |    NaN    | seller1
package a    |     12    |     15    |    NaN    |    NaN    | seller2
package b    |     17    |     16    |    14     |    NaN    | seller3

Both seller1 and seller2 have "all" products of package a, and seller3 has "all" products of package b.

How can I achieve Dataframe C?

2
  • Are numbers in product unique? Or possible repating like for first row package a | 12 | 15 | 12 | NaN ? Commented Jun 16, 2019 at 7:27
  • @jezrael the numbers in each row of Dataframe A is unique. Commented Jun 16, 2019 at 8:15

1 Answer 1

2

Idea is use DataFrame.merge with right join by helper DataFrame created by match subset of sets:

print (B)
   product_id seller_id
0          12   seller1
1          15   seller1
2          12   seller2
3          15   seller2
4          17   seller3
5          16   seller3
6          14   seller3
7          12   seller4
8          15   seller4
9          14   seller4

A = A.set_index('package_name') 
f = lambda x: set([int(y) for y in x if y == y])
a = A.apply(f, axis=1).to_dict()
#print (a)

b = B.groupby('seller_id')['product_id'].apply(set).to_dict()
#print (b)

c = [(k, k1) for k, v in a.items() for k1,v1 in b.items() if v.issubset(v1)]
#print (c)

C1 = pd.DataFrame(c, columns=['package_name','seller_id'])
print (C1)
  package_name seller_id
0    package a   seller1
1    package a   seller2
2    package a   seller4
3    package b   seller3

C = A.merge(C1, on='package_name', how='right')
print (C)
  package_name  product_1  product_2  product_3  product_4 seller_id
0    package a         12         15        NaN        NaN   seller1
1    package a         12         15        NaN        NaN   seller2
2    package a         12         15        NaN        NaN   seller4
3    package b         17         16       14.0        NaN   seller3
Sign up to request clarification or add additional context in comments.

3 Comments

Cool idea. But what if datasets were big? And how can we know that a seller with 12_14_15 products, has package 12_15 ?
@SaeedEsmaili - Edited answer. What is number of rows in real data in both DataFrames?
There are almost 100k rows in each dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.