merging two spark dataframes into one schema using python

Question

I have two different pyspark dataframes which needs to be merged into one. There is some logic that needs to be coded for the merging. One of the dataframes has the following schema: (id, type, count), and the other one has the schema: (id, timestamp, test1, test2, test3)

The first dataframe is created via sql "group by" query. There can be duplicate ids, but the type will be different for the ids. And, there is an associated count for the given type.

In the final schema (merged one), there will be different columns for the type count. The count data is retrieved from the first schema.

An example final schema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

The way I am doing it now is using two for loops to build a dictionary. I have an empty schema, and I use the dictionary to update the schema. If I do it this way, I am not really using the spark features.

schema1: (id, type, count) -- type has the values type1, type2, type3
schema2: (id, timestamp, test1, test2, test3)
finalschema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

Does anyone have any suggestion on how this can be improved?

Thanks much in advance.

LeandroHumb · Accepted Answer · 2020-01-29 15:00:06Z

2

You can use the Pyspark pivot function to pivot the first dataframe before you join it with the second one

Working example:

import pyspark.sql.functions as F
import pyspark.sql.functions as F
df = spark.createDataFrame([[1,'type1',10],
                            [1,'type2',10],
                            [1,'type3',10]],
                           schema=['id','type','quantity'])

df = df.groupBy('id').pivot('type').sum('quantity')
display(df)

You can change the aggregation at your will.

answered Jan 29, 2020 at 15:00

LeandroHumb

8739 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pg238 Over a year ago

Thank you so much LeandroHumb. You saved my day! :). This is what I needed.

pg238 Over a year ago

May I know if it would it be possible to rename the type1 to some other name during the pivoting process? I do not see any documentation for this on Spark website.

Ajay Kharade · Accepted Answer · 2020-01-29 14:50:41Z

0

You can join above two dataframe on id column, below is sample code snippet for same,

df1 schema is (id, type, count).
df2 schema is (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

merged_df = df1.join(df2, on=['id'], how='left_outer')

Hope this will help.

answered Jan 29, 2020 at 14:50

Ajay Kharade

1,5351 gold badge17 silver badges32 bronze badges

2 Comments

pg238 Over a year ago

Thank you, Ajay. In my case df2 is a different schema. I am sorry I was not clear in the question. I edited the question.

Ajay Kharade Over a year ago

Yeah, but id column is common, so you can join on that.

Collectives™ on Stack Overflow

merging two spark dataframes into one schema using python

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related