SQL query join in Pandas

Question

I would like to join two tables in Pandas.

df_types Contains range size of type product (5000 rows)

| Table: TYPES |          |      |
|--------------|----------|------|
| size_max     | size_min | type |
| 1            | 5        | S    |
| 6            | 16       | M    |
| 16           | 24       | L    |
| 25           | 50       | XL   |

Dataframe code in Pandas:

df_types = pd.DataFrame([[1,5,'S'],
                         [6,16,'M'],
                         [16,24,'L'],
                         [25,50,'XL']],
                        columns = ['size_min','size_max','type'])

df_products Contains products id and size (12000 rows)

| Table: Products |      |
|-----------------|------|
| id_product      | size |
| A               | 6    |
| B               | 25   |
| C               | 7    |
| D               | 2    |
| F               | 45   |
| E               | 10   |
| G               | 16   |

Dataframe code in Pandas:

df_products = pd.DataFrame([['A',6,],
                            ['B',25],
                            ['C',7],
                            ['D',2],
                            ['F',45],
                            ['E',10],
                            ['G',16]],columns = ['id_product','size'])

I´d like to make this SQL join in Pandas:

SELECT  *.df_products
        type.df_types
FROM    df_products     LEFT JOIN df_types
                        ON  df_products.size >= df_types.size_min
                            AND df_products.size <= df_types.size_max

RESULT:

| id_product | size | type |
|------------|------|------|
| A          | 6    | M    |
| B          | 25   | XL   |
| C          | 7    | M    |
| D          | 2    | S    |
| F          | 45   | XL   |
| E          | 10   | M    |
| G          | 16   | M    |

thank you! ;-)

Another option is to use DuckDB for SQL queries, or use a real SQL database if you're already working with one. — qwr
– qwr, Commented Mar 1, 2024 at 19:49

Erfan · Accepted Answer · 2020-02-16 12:30:18Z

Method 1: `outer join` with `pd.merge`

Although this is a common operation SQL, there's no straightforward method for this with pandas.

One of the solutions here would be to do an outer join to match all rows and then use DataFrame.query filter the rows where size is between size_min & size_max.

But this result in an explosion of rows, so in your case 12000*5000 = 60 000 000 rows.

dfn = (
    df_products.assign(key=1)
      .merge(df_types.assign(key=1), on='key')
      .query('size >= size_min & size < size_max')
      .drop(columns='key')
)

   id_product  size  size_min  size_max type
1           A     6         6        16    M
7           B    25        25        50   XL
9           C     7         6        16    M
12          D     2         1         5    S
19          F    45        25        50   XL
21          E    10         6        16    M
26          G    16        16        24    L

Method 2: `pd.IntervalIndex`:

If you don't have overlapping ranges, so if we change size_min 16 in dataframe df_types to 15, we can use this method. This will not result in an explosion of rows.

idx = pd.IntervalIndex.from_arrays(df_types['size_min'], df_types['size_max'], closed='both')
event = df_types.loc[idx.get_indexer(df_products['size']), 'type'].to_numpy()

df_products['type'] = event

  id_product  size type
0          A     6    M
1          B    25   XL
2          C     7    M
3          D     2    S
4          F    45   XL
5          E    10    M
6          G    16    L

halfer · Accepted Answer · 2023-01-08 23:23:45Z

This is way longer than Erfan's solution; I am just offering this because I believe it could help avoid the increase in number of rows resulting from a merge.

What this does is look for the cond1 and cond2 that match the where clause in the SQL query. Next step zips both lists and finds the index of the element (True, True) ... the index obtained is the equivalent of the index for df_types. concatenate all the dataframes extracted from df_types based on the indices, and concat again to df_products.

There should be a better way than this; I do believe however, that SQL does this way better.

cond1 = df_products['size'].apply(lambda x: [x>=i for i in [*df_types.size_min.array]])

cond2 = df_products['size'].apply(lambda x: [x<i for i in [*df_types.size_max.array]])

t = [list(zip(i,j)).index((True,True))
     for i,j in zip(cond1.array,cond2.array)]

result = (pd.concat([df_types.iloc[[i]]
                     for i in t])
          .filter(['type'])
          .reset_index(drop=True))

outcome = (pd.concat([df_products,result],
           axis=1,
           ignore_index=True,
           join='outer'))

outcome.columns = ['id_product', 'size', 'type']

    id_product  size    type
0   A   6   M
1   B   25  XL
2   C   7   M
3   D   2   S
4   F   45  XL
5   E   10  M
6   G   16  L

Update

Time passes, and hopefully we get better. I gave this another shot, and I moved the transaction into vanilla Python before taking the final result back to Pandas:

from itertools import product
test = [(id_product,first,last)
        for (id_product,first), (second, third,last)
        in product(zip(df_products.id_product,df_products['size']),
                   df_types.to_numpy()
                  )
        if second <= first <= third
       ]

test

[('A', 6, 'M'),
 ('B', 25, 'XL'),
 ('C', 7, 'M'),
 ('D', 2, 'S'),
 ('F', 45, 'XL'),
 ('E', 10, 'M'),
 ('G', 16, 'M'),
 ('G', 16, 'L')]

get the pandas dataframe :

pd.DataFrame(test,columns=['id_product', 'size', 'type'])
    id_product  size    type
0      A         6       M
1      B        25       XL
2      C        7        M
3      D        2        S
4      F        45       XL
5      E        10       M
6      G        16       M
7      G        16       L

Note that the last item 'G' returns two rows, since it matches those, based on the conditions.

Update 2023

Use conditional_join from pyjanitor for non-equi joins:

# pip install pyjanitor
import pandas as pd

(df_products
.conditional_join(
    df_types, 
    # column from left, column from right, comparator
    ('size', 'size_min', '>='), 
    ('size', 'size_max', '<='), 
    # depending on the data size, 
    # you could get better performance
    # by using numba, if it is installed
    use_numba=False,
    right_columns='type')
) 
  id_product  size type
0          A     6    M
1          B    25   XL
2          C     7    M
3          D     2    S
4          F    45   XL
5          E    10    M
6          G    16    M
7          G    16    L

You can also use DuckDB or pandasql (unmaintained) for SQL SELECT queries.
In terms of speed pandasql may not hold its own. Duckdb does have efficient Range join implementation

Collectives™ on Stack Overflow

SQL query join in Pandas

2 Answers 2

Method 1: `outer join` with `pd.merge`

Method 2: `pd.IntervalIndex`:

Comments

Update

Update 2023

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Method 1: outer join with pd.merge

Method 2: pd.IntervalIndex:

Comments

Update

Update 2023

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Method 1: `outer join` with `pd.merge`

Method 2: `pd.IntervalIndex`: