Combine columns into list of key, value pairs (no UDF)

Question

I'd like to create a new column that is a JSON representation of some other columns. key, value pairs in a list.

Source:

origin	destination	count
toronto	ottawa	5
montreal	vancouver	10

What I want:

origin	destination	count	json
toronto	ottawa	5	[{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}]
montreal	vancouver	10	[{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}]

(everything can be a string, doesn't matter).

I've tried something like:

df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))

But it creates the column with all the key:value pairs in one object:

{"origin":"United States","destination":"Romania"}

Is this possible without a UDF?

mck · Accepted Answer · 2021-02-12 15:56:15Z

1

A way to hack around this:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'json', 
    F.array(
        F.to_json(F.struct('origin')),
        F.to_json(F.struct('destination')),
        F.to_json(F.struct('count'))
    ).cast('string')
)

df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin  |destination|count|json                                                                |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa     |5    |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}]     |
|montreal|vancouver  |10   |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+

answered Feb 12, 2021 at 15:56

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alexander Witte Over a year ago

Hi mck thanks! One reason I want to avoid the udf is for the performance concerns i've been reading out (i have yet to try pyarrow...). Do you foresee any large computational expense with this approach (ie. calling to_json multiple times) if I run this on a big dataframe?

mck Over a year ago

@AlexanderWitte I don't think it'll be expensive. to_json looks like a trivial operation to me.

mck Over a year ago

@AlexanderWitte as a rule of thumb, Spark SQL functions will be way better than UDFs (even with pyarrow)

blackbishop · Accepted Answer · 2021-02-12 16:06:17Z

Another way by creating array of maps column before calling to_json:

from pyspark.sql import functions as F

df1 = df.withColumn(
    'json',
    F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)

df1.show(truncate=False)

#+--------+-----------+-----+------------------------------------------------------------------+
#|origin  |destination|count|json                                                              |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa     |5    |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}]     |
#|montreal|vancouver  |10   |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+

Collectives™ on Stack Overflow

Combine columns into list of key, value pairs (no UDF)

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related