I have a very typical use case where I am given large CSV/Excel files and asked to do Hive queries against a specific column.
The use case requires that I create very large "IN" statements in Hive queries using data in that specific column. It's not odious, but I very much want to reduce my touch-and-error rate in any of these so doing them by hand is undesirable.
I've been using R's glue_sql() function for this but need to convert my workflow to Python.
The way this works in Glue::glue_sql() is like this:
The CSV column name is "username". You read in the CSV as "df".
Then you define a variable for the data in the desired column: lotsofnames <- df$username
You write your sql as "select * from table where customer in ({lotsofnames*})
From there you do glue_sql(query) and it automagically makes you an "in" statement formatted correctly from the values you assigned in lotsofnames.
My big question is: Is there a Python package that does this currently?
If there is, my google-fu isn't finding it and "glue" is already a name for a very different package in Python.
I saw this answer, but it doesn't do what I need.
If not, is there an already existing function that does this?
The time/productivity savings would very much justify the time cost of converting my workflow to Python.
Thanks in Advance!