0

I'd like to create a new column that is a JSON representation of some other columns. key, value pairs in a list.

Source:

origin destination count
toronto ottawa 5
montreal vancouver 10

What I want:

origin destination count json
toronto ottawa 5 [{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}]
montreal vancouver 10 [{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}]

(everything can be a string, doesn't matter).

I've tried something like:

df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))

But it creates the column with all the key:value pairs in one object:

{"origin":"United States","destination":"Romania"}

Is this possible without a UDF?

2 Answers 2

1

A way to hack around this:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'json', 
    F.array(
        F.to_json(F.struct('origin')),
        F.to_json(F.struct('destination')),
        F.to_json(F.struct('count'))
    ).cast('string')
)

df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin  |destination|count|json                                                                |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa     |5    |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}]     |
|montreal|vancouver  |10   |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Hi mck thanks! One reason I want to avoid the udf is for the performance concerns i've been reading out (i have yet to try pyarrow...). Do you foresee any large computational expense with this approach (ie. calling to_json multiple times) if I run this on a big dataframe?
@AlexanderWitte I don't think it'll be expensive. to_json looks like a trivial operation to me.
@AlexanderWitte as a rule of thumb, Spark SQL functions will be way better than UDFs (even with pyarrow)
1

Another way by creating array of maps column before calling to_json:

from pyspark.sql import functions as F

df1 = df.withColumn(
    'json',
    F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)

df1.show(truncate=False)

#+--------+-----------+-----+------------------------------------------------------------------+
#|origin  |destination|count|json                                                              |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa     |5    |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}]     |
#|montreal|vancouver  |10   |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.