2

I am building a python application with a lot of interactions between Amazon Redshift and local python (sending queries to redshift, sending results to local etc...). My question is: what is the cleanest way to handle such interactions.

Currently, I am using sqlalchemy to load tables directly on local thanks to pandas.read_sql(). But I am not sure this is very optimised or safe.

Would it be better to go through Amazon S3, and then bring back files with boto, to finally read them with pandas.read_csv()?

Finally, is there a better idea to handle such interactions, maybe not doing everything in Python?

1 Answer 1

3

You can look at the blaze ecosystem for ideas and libraries you might find useful: http://blaze.pydata.org

The blaze library itself lets you write queries at a high, pandas-like level, and then it translates the query to redshift (using SQLAlchemy): http://blaze.readthedocs.org/en/latest/index.html

But this may be too high-level for your purposes and you might need more precise control over the behavior -- but it would let you keep the code similar regardless of how and when you moved the data around.

The odo library can be used independently to copy from Redshift to S3 to local files and back. This can be used independently of the blaze library: http://odo.readthedocs.org/en/latest/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.