0

I'm loading some data from SQL database to Python, but I need to apply some criteria from Python Dataframe, to be simplified, see example below:

    some_sql = """
               select column1,columns2 
               from table 
               where a between '{}' and '{}'
                    or a between '{}' and '{}'
                    or a between '{}' and '{}'
              """.format(date1,date2,date3,date4,date5,date6)

date1,date2,date3,date4,date5,date6 are sourced from Python Dataframe. I can manually specify all 6 parameters, but I do have over 20 in fact...

     df = DataFrame({'col1':['date1','date3','date5'],
                     'col2':['date2','date4','date6']})

is there a way I am able to do a loop here to be more efficient

2 Answers 2

1

Setup

# Create a dummy dataframe
df = pd.DataFrame({'col1':['date1','date3','date5'],
                   'col2':['date2','date4','date6']})

# Prepare the SQL (conditions will be added later)
some_sql = """
select column1,columns2 
from table 
where """

First approach

conditions = []
for row in df.iterrows():
    # Ignore the index
    data = row[1]
    conditions.append(f"or a between '{data['col1']}' and '{data['col2']}'")

some_sql += '\n'.join(conditions)

By using iterrows() we can iterate through the dataframe, rows by row.

Alternative

some_sql += '\nor '.join(df.apply(lambda x: f"a between '{x['col1']}' and '{x['col2']}'", axis=1).tolist())

Using apply() should be faster that iterrows():

Although apply() also inherently loops through rows, it does so much more efficiently than iterrows() by taking advantage of a number of internal optimizations, such as using iterators in Cython.

source

Another alternative

some_sql += '\nor '.join([f"a between '{row['col1']}' and '{row['col2']}'" for row in df.to_dict('records')])

This converts the dataframe to a list of dicts, and then applies a list comprehension to create the conditions.

Result

select column1,columns2 
from table 
where a between 'date1' and 'date2'
or a between 'date3' and 'date4'
or a between 'date5' and 'date6'
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Kristof, further to my question: if I need to insert some params in the middle of SQL query, for instance, "select case when xxx then yyy", xxx and yyy are params. Shall I break sql code into pieces to apply conditions, or use iterators?
0

As a secondary note to Kristof's answer above, I would note that even as an analyst one should probably be careful about things like SQL injection, so inlining data is something to be avoided.

If possible you should define your query once with placeholders and then create a param list to go with the placeholders. This also saves on the formatting too.

So in your case your query looks like:

some_sql = """
           select column1,columns2 
           from table 
           where a between ? and ?
                or a between ? and ?
                or a between ? and ?

And our param list generation is going to look like:

conditions = []
for row in df.iterrows():
    # Ignore the index
    data = row[1]
    conditions.append(data['col1'])
    conditions.append(data['col2'])

Then execute your SQL with placeholder syntax and params list as placeholders.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.