1

My question is regarding ETL best practices and how to make things a bit simpler/efficient.

I have some data that I pull from an API, which is then transformed. Let's call this "df_upc" and assume it's a Pandas dataframe of UPCs and I'll omit the earlier transformation process for lack of relevancy. This data is pushed to SQL Server via the Pandas to_sql method: creating/updating a table, and then a Select Statement in SQL joins the table and some other data to get current inventory levels. The SQL Server table that holds the inventory levels holds millions of SKUs.

I don't love having to create/update a table every time I want to join some data from the server to my local file and would love a better way to read it in. I've tried querying the data without the join, but then it loads all the millions of rows before joining it in Python and ends up being super slow. Looking for a way to do this in the Python script so it can remain automated. The process works fine for now, I just feel like there's a better way. Thanks for any help/suggestions!

import sqlalchemy, pandas as pd
from sqlalchemy import create_engine

engine1 = sqlalchemy.create_engine('mssql+pyodbc://@SERVER/DATABASE?trusted_connection=yes&driver=SQL+Server')
engine2 = sqlalchemy.create_engine('mssql+pyodbc://@SAME_SERVER/ANOTHER_DATABASE?trusted_connection=yes&driver=SQL+Server')

df_upc.to_sql('myTable', engine1, if_exists='replace', index=False)

#query has been stripped down for this example
query = pd.read_sql_query(
  '''SELECT s.qty, s.upc, m.sku
FROM StyleInventory s
INNER JOIN marketing.dbo.myTable m on s.upc = m.upc;''', engine2) 

df_joined = pd.DataFrame(query)

df_joined.to_excel("joined.xlsx", index=False)
6
  • 1
    Related answer here. That example uses a TVP for an UPDATE, but the SP could have just as easily done a SELECT. Commented May 18, 2021 at 21:00
  • How large is the myTable (rows and columns)? Commented May 19, 2021 at 2:16
  • This question/answer or Use temp table with SQLAlchemy might give you alternative solution using CTE Commented May 19, 2021 at 2:38
  • @van, about 4000 rows, often only UPC and maybe a SKU column. Commented May 19, 2021 at 12:14
  • I see. I still would look at the first link in my comment above using VALUES construct. Which RDBMS are you using? Commented May 19, 2021 at 13:28

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.