1

I am very new to Pyspark. So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table. So 1st part I have completed and I am getting id by using below piece of code.

criteria_df = read_data_from_table(criteria_tbl)
datasource_df = read_data_from_table(data_source_tbl)
import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.

2 Answers 2

2

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join, instead of collecting dataframes and comparing them row by row.

Sign up to request clarification or add additional context in comments.

2 Comments

Okay! So after join, I can easily iterate the dataframe that is resulted.
@VishnuChaturvedi iteration of dataframes are generally slow. It's better to use spark sql functions to operate on dataframes in parallel, which gives much better performance
2

You can use the semi-join:

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.