How can we write a udf in pyspark for parsing complex column data

Question

I have column values which is of the form of {"1":"mediaMaaadadeftch||OAISAOID|true|ModsVersio|67900|clk|true|PPOOOS|20220501164113|34958|38177557..}

This is not a json format, some values are pipe separated and some are double pipe separated, how can we write a udf which breaks this value and convert into multiple columns.

col_1|col_2|col_3|col_4|..
1|mediaMaaadadeftch|OAISAOID|true| ..

Sachin Tiwari · Accepted Answer · 2022-06-03 17:13:03Z

0

instead of writing udf , you can do it by csv file

store the data in a csv file and load it

>>> df1 = spark.read.load("/path_to/sample.csv",format="csv", sep="|")
>>> df1.show()
+--------------------+----+--------+----+----------+-----+---+----+------+--------------+-----+--------+
|                 _c0| _c1|     _c2| _c3|       _c4|  _c5|_c6| _c7|   _c8|           _c9| _c10|    _c11|
+--------------------+----+--------+----+----------+-----+---+----+------+--------------+-----+--------+
|"1":"mediaMaaadad...|null|OAISAOID|true|ModsVersio|67900|clk|true|PPOOOS|20220501164113|34958|38177557|
+--------------------+----+--------+----+----------+-----+---+----+------+--------------+-----+--------+

columns with double pipe "||" will be null , so if you don't need those columns you can remove them

answered Jun 3, 2022 at 17:13

Sachin Tiwari

3172 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sanjay Nirmal Over a year ago

data is very big, csv creating is not possible.

Collectives™ on Stack Overflow

How can we write a udf in pyspark for parsing complex column data

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related