0

using spark 2.3.2 with python, I am trying implement "alias" to join two dataframes after applying some filter in a single line as in below code. But Its throwing below error

code:

    orders.filter(orders.order_status.isin("CLOSED","COMPLETE")).select("order_id","order_date").alias("a").\
        join(orderitems.select("order_item_order_id","order_item_subtotal").alias("b"),a.order_id==b.order_item_order_id).\
        drop(b.order_item_order_id)

error:

        NameError: name 'a' is not defined

I need to get CLOSED and COMPLETE orders from dataframe:orders and then in the same step, I need to join the resultant dataframe with another dataframe:orderitems and then drop the duplicate column. So I am looking for implementing "alias" to a dataframe as same as alias to a table in SQL. Could any one please help me to understand where I am going wrong?

2 Answers 2

2

you don't need the alias - you can specify orderitems.order_item_order_id in the drop directive and order.order_id==orderitems.order_item_order_id in the joing cluase

If you want shorter names you can break this to multiple statements (the overall execution would be the same since spark generates the execution plan later)

a=orders.filter(orders.order_status.isin("CLOSED","COMPLETE")).select("order_id","order_date")
b=orderitems.select("order_item_order_id","order_item_subtotal")

and then you can use a and b in the join and drop

Sign up to request clarification or add additional context in comments.

3 Comments

Yeah. I had this in mind, but was thinking if it could have any performance impact. But yeah, as you said, spark will generate the DAG of stages and physical execution plan later. I forgot that for a moment. Thanks for the help. But since I came with "alias", i would like to know , can't we implement "alias" names to dataframes just like "alias to tables SQL"? I know we can do it for column names in dataframe but what about tables?
@akhilpathirippilly you can df.registerTempTable(alias) and then write SQL proper with spark.sql('select ..')
yeah,that becomes proper spark sql. I was just checking on if its possible on dataframe APIs.But any way , if we have these many options, why to go for another one. Any way thanks for spending your time on my post. :)
1

Try to give your column name in col(alias.col_name).

orders\
.filter(orders.order_status.isin("CLOSED","COMPLETE"))\
.select("order_id","order_date").alias("a")\
.join(orderitems.select("order_item_order_id","order_item_subtotal")\
.alias("b"),col("a.order_id")== col("b.order_item_order_id"))

Try using that it will work

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.