How to get a value from one pyspark dataframe using where clause

Question

I am very new to Pyspark. So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table. So 1st part I have completed and I am getting id by using below piece of code.

criteria_df = read_data_from_table(criteria_tbl)

datasource_df = read_data_from_table(data_source_tbl)

import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.

mck · Accepted Answer · 2020-12-30 10:15:35Z

2

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join, instead of collecting dataframes and comparing them row by row.

answered Dec 30, 2020 at 10:15

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vishnu Chaturvedi Over a year ago

Okay! So after join, I can easily iterate the dataframe that is resulted.

mck Over a year ago

@VishnuChaturvedi iteration of dataframes are generally slow. It's better to use spark sql functions to operate on dataframes in parallel, which gives much better performance

Mykola Zotko · Accepted Answer · 2020-12-30 10:26:38Z

2

You can use the semi-join:

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

answered Dec 30, 2020 at 10:26

Mykola Zotko

18.2k6 gold badges88 silver badges91 bronze badges

Collectives™ on Stack Overflow

How to get a value from one pyspark dataframe using where clause

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related