2

I would like to join two tables in Pandas.

df_types Contains range size of type product (5000 rows)

| Table: TYPES |          |      |
|--------------|----------|------|
| size_max     | size_min | type |
| 1            | 5        | S    |
| 6            | 16       | M    |
| 16           | 24       | L    |
| 25           | 50       | XL   |

Dataframe code in Pandas:

df_types = pd.DataFrame([[1,5,'S'],
                         [6,16,'M'],
                         [16,24,'L'],
                         [25,50,'XL']],
                        columns = ['size_min','size_max','type'])

df_products Contains products id and size (12000 rows)

| Table: Products |      |
|-----------------|------|
| id_product      | size |
| A               | 6    |
| B               | 25   |
| C               | 7    |
| D               | 2    |
| F               | 45   |
| E               | 10   |
| G               | 16   |

Dataframe code in Pandas:

df_products = pd.DataFrame([['A',6,],
                            ['B',25],
                            ['C',7],
                            ['D',2],
                            ['F',45],
                            ['E',10],
                            ['G',16]],columns = ['id_product','size'])

I´d like to make this SQL join in Pandas:

SELECT  *.df_products
        type.df_types
FROM    df_products     LEFT JOIN df_types
                        ON  df_products.size >= df_types.size_min
                            AND df_products.size <= df_types.size_max

RESULT:

| id_product | size | type |
|------------|------|------|
| A          | 6    | M    |
| B          | 25   | XL   |
| C          | 7    | M    |
| D          | 2    | S    |
| F          | 45   | XL   |
| E          | 10   | M    |
| G          | 16   | M    |

thank you! ;-)

3
  • How large are your tables? Amount of rows Commented Feb 16, 2020 at 12:15
  • df_types 5000 rows and df_products 12000 rows Commented Feb 16, 2020 at 12:19
  • Another option is to use DuckDB for SQL queries, or use a real SQL database if you're already working with one. Commented Mar 1, 2024 at 19:49

2 Answers 2

2

Method 1: outer join with pd.merge

Although this is a common operation SQL, there's no straightforward method for this with pandas.

One of the solutions here would be to do an outer join to match all rows and then use DataFrame.query filter the rows where size is between size_min & size_max.

But this result in an explosion of rows, so in your case 12000*5000 = 60 000 000 rows.

dfn = (
    df_products.assign(key=1)
      .merge(df_types.assign(key=1), on='key')
      .query('size >= size_min & size < size_max')
      .drop(columns='key')
)

   id_product  size  size_min  size_max type
1           A     6         6        16    M
7           B    25        25        50   XL
9           C     7         6        16    M
12          D     2         1         5    S
19          F    45        25        50   XL
21          E    10         6        16    M
26          G    16        16        24    L

Method 2: pd.IntervalIndex:

If you don't have overlapping ranges, so if we change size_min 16 in dataframe df_types to 15, we can use this method. This will not result in an explosion of rows.

idx = pd.IntervalIndex.from_arrays(df_types['size_min'], df_types['size_max'], closed='both')
event = df_types.loc[idx.get_indexer(df_products['size']), 'type'].to_numpy()

df_products['type'] = event

  id_product  size type
0          A     6    M
1          B    25   XL
2          C     7    M
3          D     2    S
4          F    45   XL
5          E    10    M
6          G    16    L
Sign up to request clarification or add additional context in comments.

Comments

0

This is way longer than Erfan's solution; I am just offering this because I believe it could help avoid the increase in number of rows resulting from a merge.

What this does is look for the cond1 and cond2 that match the where clause in the SQL query. Next step zips both lists and finds the index of the element (True, True) ... the index obtained is the equivalent of the index for df_types. concatenate all the dataframes extracted from df_types based on the indices, and concat again to df_products.

There should be a better way than this; I do believe however, that SQL does this way better.

cond1 = df_products['size'].apply(lambda x: [x>=i for i in [*df_types.size_min.array]])

cond2 = df_products['size'].apply(lambda x: [x<i for i in [*df_types.size_max.array]])

t = [list(zip(i,j)).index((True,True))
     for i,j in zip(cond1.array,cond2.array)]

result = (pd.concat([df_types.iloc[[i]]
                     for i in t])
          .filter(['type'])
          .reset_index(drop=True))

outcome = (pd.concat([df_products,result],
           axis=1,
           ignore_index=True,
           join='outer'))

outcome.columns = ['id_product', 'size', 'type']

    id_product  size    type
0   A   6   M
1   B   25  XL
2   C   7   M
3   D   2   S
4   F   45  XL
5   E   10  M
6   G   16  L

Update

Time passes, and hopefully we get better. I gave this another shot, and I moved the transaction into vanilla Python before taking the final result back to Pandas:

from itertools import product
test = [(id_product,first,last)
        for (id_product,first), (second, third,last)
        in product(zip(df_products.id_product,df_products['size']),
                   df_types.to_numpy()
                  )
        if second <= first <= third
       ]

test

[('A', 6, 'M'),
 ('B', 25, 'XL'),
 ('C', 7, 'M'),
 ('D', 2, 'S'),
 ('F', 45, 'XL'),
 ('E', 10, 'M'),
 ('G', 16, 'M'),
 ('G', 16, 'L')]

get the pandas dataframe :

pd.DataFrame(test,columns=['id_product', 'size', 'type'])
    id_product  size    type
0      A         6       M
1      B        25       XL
2      C        7        M
3      D        2        S
4      F        45       XL
5      E        10       M
6      G        16       M
7      G        16       L

Note that the last item 'G' returns two rows, since it matches those, based on the conditions.

Update 2023

Use conditional_join from pyjanitor for non-equi joins:

# pip install pyjanitor
import pandas as pd

(df_products
.conditional_join(
    df_types, 
    # column from left, column from right, comparator
    ('size', 'size_min', '>='), 
    ('size', 'size_max', '<='), 
    # depending on the data size, 
    # you could get better performance
    # by using numba, if it is installed
    use_numba=False,
    right_columns='type')
) 
  id_product  size type
0          A     6    M
1          B    25   XL
2          C     7    M
3          D     2    S
4          F    45   XL
5          E    10    M
6          G    16    M
7          G    16    L

2 Comments

You can also use DuckDB or pandasql (unmaintained) for SQL SELECT queries.
In terms of speed pandasql may not hold its own. Duckdb does have efficient Range join implementation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.