0

I have two different pyspark dataframes which needs to be merged into one. There is some logic that needs to be coded for the merging. One of the dataframes has the following schema: (id, type, count), and the other one has the schema: (id, timestamp, test1, test2, test3)

The first dataframe is created via sql "group by" query. There can be duplicate ids, but the type will be different for the ids. And, there is an associated count for the given type.

In the final schema (merged one), there will be different columns for the type count. The count data is retrieved from the first schema.

An example final schema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

The way I am doing it now is using two for loops to build a dictionary. I have an empty schema, and I use the dictionary to update the schema. If I do it this way, I am not really using the spark features.

schema1: (id, type, count) -- type has the values type1, type2, type3
schema2: (id, timestamp, test1, test2, test3)
finalschema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

Does anyone have any suggestion on how this can be improved?

Thanks much in advance.

2 Answers 2

2

You can use the Pyspark pivot function to pivot the first dataframe before you join it with the second one

Working example:

import pyspark.sql.functions as F
import pyspark.sql.functions as F
df = spark.createDataFrame([[1,'type1',10],
                            [1,'type2',10],
                            [1,'type3',10]],
                           schema=['id','type','quantity'])

df = df.groupBy('id').pivot('type').sum('quantity')
display(df)

You can change the aggregation at your will.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much LeandroHumb. You saved my day! :). This is what I needed.
May I know if it would it be possible to rename the type1 to some other name during the pivoting process? I do not see any documentation for this on Spark website.
0

You can join above two dataframe on id column, below is sample code snippet for same,

df1 schema is (id, type, count).
df2 schema is (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

merged_df = df1.join(df2, on=['id'], how='left_outer')

Hope this will help.

2 Comments

Thank you, Ajay. In my case df2 is a different schema. I am sorry I was not clear in the question. I edited the question.
Yeah, but id column is common, so you can join on that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.