1

I have 2 spark dataframes:

Location    Date        Date_part   Sector      units   
USA         7/1/2021    7/1/2021    Cars        200     
IND         7/1/2021    7/1/2021    Scooters    180     
COL         7/1/2021    7/1/2021    Trucks      100     
Location    Date    Brands  units   values    
UK          null    brand1  400     120       
AUS         null    brand2  450     230       
CAN         null    brand3  150     34        

after doing unionByName, I got

Location    Date        Date_part   Sector      Brands  units   values
USA         7/1/2021    7/1/2021    Cars                200     
IND         7/1/2021    7/1/2021    Scooters            180     
COL         7/1/2021    7/1/2021    Trucks              100
UK          null        null                    brand1  400     120
AUS         null        null                    brand2  450     230
CAN         null        null                    brand3  150     34

But my expected dataframe is:

Location    Date        Date_part   Sector      Brands  units   values
USA         7/1/2021    7/1/2021    Cars                200     
IND         7/1/2021    7/1/2021    Scooters            180     
COL         7/1/2021    7/1/2021    Trucks              100
UK          null        7/1/2021                brand1  400     120
AUS         null        7/1/2021                brand2  450     230
CAN         null        7/1/2021                brand3  150     34

I need the values in date_part column to be of dataframe 1 values for all the rows. I tried with this code:

df_result=df_final.select(df_1['date_part'], df_final["*"])

This is creating an extra column date_part. How to acheive my expected dataframe?

3
  • is Date_part always the same? Commented Aug 19, 2021 at 10:39
  • yes it will be same in dataframe 1 Commented Aug 19, 2021 at 10:40
  • This comment will resolve your query. stackoverflow.com/a/30045284/12843137 Commented Aug 19, 2021 at 10:59

1 Answer 1

2

assuming Date_part is the same for the whole dataframe, there are several solutions to do that - here is one :

from pyspark.sql import functions as F

missing_date = df_result.where(F.col("Date_part").isNotNull()).first()["Date_part"]

df_result = df_result.withColumn("Date_part", F.lit(missing_date))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.