How do I make inner join with external joiner on sql in python in efficient way

Question

I want to merge an excel file with sql in pandas, here's my code

import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')

It works, just to make it clearer

If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')

If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')

If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')

Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?

If data1['id'] is large (thousands of ids), you might find the solution from here stackoverflow.com/questions/48392311/… efficient. It'll allow you to use proper joins, in addition to IN operator queries. — Ilja Everilä
– Ilja Everilä, Commented May 30, 2018 at 7:33

Ilja Everilä · Accepted Answer · 2018-05-30 09:04:32Z

You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:

from sqlalchemy import MetaData, select

metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']

and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:

query = select([customer.c.id,
                customer.c.column3,
                customer.c.column4]).\
    where(customer.c.id.in_(data1['id']))

data2 = pd.read_sql_query(query, engine)

If you wish to keep on using SQL strings manually, you could build a parameterized query as such:

placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])

In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters¹.

¹: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.

Nishant Patel · Accepted Answer · 2018-05-30 06:33:21Z

0

Since you are looking into a condition as WHERE IN [Some_List]. This should work for you

id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)

Hope it works.

answered May 30, 2018 at 6:33

Nishant Patel

1,43617 silver badges25 bronze badges

Collectives™ on Stack Overflow

How do I make inner join with external joiner on sql in python in efficient way

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related