2

I have two dataframes df and df1, in df I have NULL in some columns, but in df1 I have non-null values for these columns. I just need to overwrite rows where the NULL exists.

The df is below:

+------------+--------------------+-------+---------------+--------------------+----------+------------+
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|      null|        null|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
| 60129542147|Ubaa Old Crawford...|     US|    Des Plaines|     5460 N River Rd|      null|        null|

The df1 is below:

+-------------+--------------------+-------+------------+--------------------+----------+------------+
|           Id|                Name|Country|        City|             Address|  Latitude|   Longitude|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
|  42949672960|Americana Resort ...|     US|      Dillon|         135 Main St|39.6286685|-106.0451009|
|  60129542147|Ubaa Old Crawford...|     US| Des Plaines|     5460 N River Rd|42.0654049| -87.8916252|
+-------------+--------------------+-------+------------+--------------------+----------+------------+

I want this result:

+------------+--------------------+-------+---------------+--------------------+----------+------------+
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
...
...
2
  • join then coalesce? Commented Oct 28, 2021 at 19:18
  • @pltc Could you share the implementation, please? Commented Oct 28, 2021 at 19:21

1 Answer 1

2

You can either left join or inner join them then using coalesce to pick first non-null lat/lon.

df1
+-----------+---------+----------+
|         id|      lat|       lon|
+-----------+---------+----------+
|42949672960|     null|      null|
|42949672965|29.054737|-81.297208|
|60129542147|     null|      null|
+-----------+---------+----------+

df2
+-----------+----------+------------+
|         id|       lat|         lon|
+-----------+----------+------------+
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
+-----------+----------+------------+

Join them together

from pyspark.sql import functions as F

(df1
    .join(df2, on=['id'], how='left')
    .select(
        F.col('id'),
        F.coalesce(df1['lat'], df2['lat']).alias('lat'),
        F.coalesce(df1['lon'], df2['lon']).alias('lon')
    )
    .show()
)
# +-----------+----------+------------+
# |         id|       lat|         lon|
# +-----------+----------+------------+
# |42949672965| 29.054737|  -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
# +-----------+----------+------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.