How to implement "alias" to a data frame (not to a data frame column) in pyspark

Question

using spark 2.3.2 with python, I am trying implement "alias" to join two dataframes after applying some filter in a single line as in below code. But Its throwing below error

code:

    orders.filter(orders.order_status.isin("CLOSED","COMPLETE")).select("order_id","order_date").alias("a").\
        join(orderitems.select("order_item_order_id","order_item_subtotal").alias("b"),a.order_id==b.order_item_order_id).\
        drop(b.order_item_order_id)

error:

        NameError: name 'a' is not defined

I need to get CLOSED and COMPLETE orders from dataframe:orders and then in the same step, I need to join the resultant dataframe with another dataframe:orderitems and then drop the duplicate column. So I am looking for implementing "alias" to a dataframe as same as alias to a table in SQL. Could any one please help me to understand where I am going wrong?

Arnon Rotem-Gal-Oz · Accepted Answer · 2019-02-17 12:02:56Z

2

you don't need the alias - you can specify orderitems.order_item_order_id in the drop directive and order.order_id==orderitems.order_item_order_id in the joing cluase

If you want shorter names you can break this to multiple statements (the overall execution would be the same since spark generates the execution plan later)

a=orders.filter(orders.order_status.isin("CLOSED","COMPLETE")).select("order_id","order_date")
b=orderitems.select("order_item_order_id","order_item_subtotal")

and then you can use a and b in the join and drop

answered Feb 17, 2019 at 12:02

Arnon Rotem-Gal-Oz

26k3 gold badges51 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

akhil pathirippilly Over a year ago

Yeah. I had this in mind, but was thinking if it could have any performance impact. But yeah, as you said, spark will generate the DAG of stages and physical execution plan later. I forgot that for a moment. Thanks for the help. But since I came with "alias", i would like to know , can't we implement "alias" names to dataframes just like "alias to tables SQL"? I know we can do it for column names in dataframe but what about tables?

Arnon Rotem-Gal-Oz Over a year ago

@akhilpathirippilly you can df.registerTempTable(alias) and then write SQL proper with spark.sql('select ..')

akhil pathirippilly Over a year ago

yeah,that becomes proper spark sql. I was just checking on if its possible on dataframe APIs.But any way , if we have these many options, why to go for another one. Any way thanks for spending your time on my post. :)

Martin Brisiak · Accepted Answer · 2021-01-12 11:29:00Z

1

Try to give your column name in col(alias.col_name).

orders\
.filter(orders.order_status.isin("CLOSED","COMPLETE"))\
.select("order_id","order_date").alias("a")\
.join(orderitems.select("order_item_order_id","order_item_subtotal")\
.alias("b"),col("a.order_id")== col("b.order_item_order_id"))

Try using that it will work

edited Jan 12, 2021 at 11:29

Martin Brisiak

4,11112 gold badges40 silver badges51 bronze badges

answered Jan 12, 2021 at 11:07

sj17

231 silver badge8 bronze badges

Collectives™ on Stack Overflow

How to implement "alias" to a data frame (not to a data frame column) in pyspark

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related