Convert each row of pyspark DataFrame column to a Json string

Question

How to create a column with json structure based on other columns of a pyspark dataframe.

For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe

df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']

df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)

Expected output

过过招 · Accepted Answer · 2022-04-22 02:02:47Z

4

Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.

import pyspark.sql.functions as F
....

lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

edited Apr 22, 2022 at 2:02

answered Apr 22, 2022 at 1:52

过过招

4,3372 gold badges7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shanoo Over a year ago

I have several other columns but want json structure only on Address and Zip, so can't use *

过过招 Over a year ago

I have updated the answer according to your needs.

Collectives™ on Stack Overflow

Convert each row of pyspark DataFrame column to a Json string

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related