1

I have a very typical use case where I am given large CSV/Excel files and asked to do Hive queries against a specific column.

The use case requires that I create very large "IN" statements in Hive queries using data in that specific column. It's not odious, but I very much want to reduce my touch-and-error rate in any of these so doing them by hand is undesirable.

I've been using R's glue_sql() function for this but need to convert my workflow to Python.

The way this works in Glue::glue_sql() is like this:

The CSV column name is "username". You read in the CSV as "df".

Then you define a variable for the data in the desired column: lotsofnames <- df$username

You write your sql as "select * from table where customer in ({lotsofnames*})

From there you do glue_sql(query) and it automagically makes you an "in" statement formatted correctly from the values you assigned in lotsofnames.

My big question is: Is there a Python package that does this currently?

If there is, my google-fu isn't finding it and "glue" is already a name for a very different package in Python.

I saw this answer, but it doesn't do what I need.

If not, is there an already existing function that does this?
The time/productivity savings would very much justify the time cost of converting my workflow to Python.

Thanks in Advance!

1 Answer 1

0

I have encountered this problem in the past. I solved it using a combination of sqlalchemy.and_ and sqlalchemy.or_:

import sqlachemy as sa

# Let's say you want to find all the Mr. Smith and Ms. Elliott
params = pd.DataFrame({
    'Title': ['Mr.', 'Ms.'],
    'LastName': ['Smith', 'Elliott']
})

# Setting up the connection
engine = sa.create_engine('...')
meta = sa.MetaData(engine)

# Get the table's structure from the database. I'm accessing the
# `Person.Person` table in the AdventureWorks sample DB in SQL Server. You may
# not need to specify the `schema` keyword for your use case
table = sa.Table('Person', meta, schema='Person', autoload_with=engine)

# Here's the magic: `or_` down the rows, `and_` across the columns.
# `table.c.LastName` refers to column LastName in `table`
cond = sa.or_(*[
    sa.and_(table.c.Title == row['Title'], table.c.LastName == row['LastName'])
        for _, row in params.iterrows()
])

# Get Title, FirstName, MiddleName and LastName from rows matching the
# conditions
result = sa.select([
    table.c.Title,
    table.c.FirstName,
    table.c.MiddleName,
    table.c.LastName,
]).where(cond).execute()

# You can turn the result into a DataFrame if you want
result_df = pd.DataFrame(result, columns=result.keys())

Result:

Title FirstName MiddleName LastName
  Ms.     Carol         B.  Elliott
  Ms.   Shannon         P.  Elliott
  Mr.   Leonard         J.    Smith
  Mr.   Rolando         T.    Smith
  Mr.      Jeff       None    Smith
  Mr.    Mahesh       None    Smith
  Mr.     Frank       None    Smith
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.