I have two Spark DataFrames:
cities DataFrame with the following column:
city
-----
London
Austin
bigCities DataFrame with the following column:
name
------
London
Cairo
I need to transform DataFrame cities and add an additional Boolean column there: bigCity Value of this column must be calculated based on the following condition "cities.city IN bigCities.name"
I can do this in the following way(with a static bigCities collection):
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
but I don't know how to replace the static bigCities collection ['London', 'Cairo'] with bigCities DataFrame in the query. I want to use bigCities as the reference data in the query.
Please advise how to achieve this.
cities.cityandbigCities.namecolumn. If value coming from right table is null in the result, you can flag it withfalsewithwithColumnandwhenfunction on Spark.bigCities.namecolumn into a list. Afterwards, you can filter or flag yourcityDataset withisinfunction ofColumn.