0

I want to merge an excel file with sql in pandas, here's my code

import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')

It works, just to make it clearer

If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')

If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')

If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')

Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?

1
  • If data1['id'] is large (thousands of ids), you might find the solution from here stackoverflow.com/questions/48392311/… efficient. It'll allow you to use proper joins, in addition to IN operator queries. Commented May 30, 2018 at 7:33

2 Answers 2

1

You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:

from sqlalchemy import MetaData, select

metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']

and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:

query = select([customer.c.id,
                customer.c.column3,
                customer.c.column4]).\
    where(customer.c.id.in_(data1['id']))

data2 = pd.read_sql_query(query, engine)

If you wish to keep on using SQL strings manually, you could build a parameterized query as such:

placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])

In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters1.

1: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.

Sign up to request clarification or add additional context in comments.

Comments

0

Since you are looking into a condition as WHERE IN [Some_List]. This should work for you

id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)

Hope it works.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.