2

How to create a column with json structure based on other columns of a pyspark dataframe.

For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe

df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']

df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)

Expected output enter image description here

1 Answer 1

4

Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.

import pyspark.sql.functions as F
....

lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)
Sign up to request clarification or add additional context in comments.

2 Comments

I have several other columns but want json structure only on Address and Zip, so can't use *
I have updated the answer according to your needs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.