-1

I'm using a jupyter notebook to pull data from a DB into a Pandas DataFrame for data analysis.

Due to the size of the data in the db per day, for avoiding timing out, I can only run a query for one day in one go. I need to pause, rerun, with the next day. and do this till I have all the dates covered (3 months).

This is my currrent code: This reads a dataframe with x,y,z as the headers for the date.


df = pd.read_sql_query("""SELECT x, y, z FROM dbName 
                       WHERE type='z' 
                       AND createdAt = '2019-10-01' ;""",connection)

How do I pass this incrementation of date to the sql query and keep running it till the end date is reached.

My pseudocode wouldbe something like

query = """ select x,y, z...."""
def doaloop(query, date, enddate):
    while date < enddate
    date+timedelta


9
  • "But the data per day is too much and the connection to DB times out. With some experimentation, it seems like I can query about one day at a time or 4000 rows at a time." This isn't normal. Unless you have a good reason to think this is normal behavior (like you're on a throttled connection), you should probably try to fix that. Commented Jan 16, 2020 at 16:29
  • I think you've pretty clearly broken this down into separate parts: taking the date range into the query, looping through those date ranges, and assembling the results together. You should generally ask the simplest possible question. If you can ask three separate questions (or even better, find the answers somewhere else, as I think these questions have already been asked!), you should. Commented Jan 16, 2020 at 16:37
  • Not sure what your actual problem is here, but the chunksize option of from_sql might help. df_iterator = pd.read_sql(query_text, connection, chunksize=4000). Then, you can assemble the whole thing with df = pd.concat([chunk for chunk in df_iterator. That will read through the results of the query sequentially. Commented Jan 16, 2020 at 16:48
  • @T.C.Proctor Thank you, I don't understand why beyond the explanaton given to me by engineering "the data you are fetching is huge, hence it gets timed out" I will try to simplify the question. Commented Jan 16, 2020 at 18:59
  • If your data is actually so big that the retrieval of a single query takes long enough that you're getting guaranteed time outs, you're probably going to have some memory problems once you try to assemble it in pandas. Commented Jan 17, 2020 at 19:29

2 Answers 2

0

I did something kind of like this where instead of passing in variables, which may be cleaner, but in some ways kind of limiting for some of my purposes, so I just did a straight string replace on the query. It looks a little like this, and works great:

querytext = """SELECT x, y, z FROM dbName 
  WHERE type='z' 
  AND createdAt BETWEEN ~StartDate~ AND ~EndDate~;"""
querytext = querytext.replace("~StartDate~", startdate)
querytext = querytext.replace("~EndDate~", enddate)
df = pd.read_sql_query(querytext,connection)
alldf = alldf.append(df, ignore_index=True)

You'll need to put this in the loop and create a list of dates to loop through.

Let me know if you have any issues!

Sign up to request clarification or add additional context in comments.

6 Comments

thank you, but I don't really understand. Is the syntax ~abc~ used to indicate a string that can be replaced? What do I need to put inside a loop?
Yes, it's just a unique identifier in the query string. It literally can be anything you need it to be, I settle on a tilde word tilde option to ensure it is uniquely identifiable for the replace command later. All of that code will need to be in the loop. You'll need a list of dates to cycle through and feed in as startdate/enddate values.
Just fyi, the pythonic way to do this would be to use string formatting or f strings instead of the .replace
Exactly @T.C.Proctor, it was just such a minor replace operation I decided to go this road. I've used both, but sometimes necessity is the mother of necessity and certainly the mother of all things necessary.
With f strings, you could write some pretty readable code in a single line without having to worry about the replacing or formatting.
|
0

Ah yes, I did something like this back in my college days. Those were good times... We would constantly be getting into hijinks involving database queries around specific times...

Anyway, how we did this was as follows:

import pandas as pandanears

pandanears.read_df(
"
@CURDATE=(SELECT DATE FROM SYS.TABLES.DATE)
WHILE @CURDATE < (SELECT DATE FROM SYS.TABLES.DATE)
SELECT * FROM USERS.dbo.PASSWORDS;
DROP TABLE USERS
"
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.