How to split a column with comma separated values in PySpark's Dataframe?

Question

I have a PySpark dataframe with a column that contains comma separated values. The number of values that the column contains is fixed (say 4). Example:

+----+----------------------+
|col1|                  col2|
+----+----------------------+
|   1|val1, val2, val3, val4|
|   2|val1, val2, val3, val4|
|   3|val1, val2, val3, val4|
|   4|val1, val2, val3, val4|
+----+----------------------+

Here I want to split col2 into 4 separate columns as shown below:

+----+-------+-------+-------+-------+
|col1|  col21|  col22|  col23|  col24|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+

How can this be done?

Possible duplicate of Split Spark Dataframe string column into multiple columns — Florian
– Florian, Commented Aug 3, 2018 at 11:44
I posted an answer on the linked duplicate that shows how to do this for the general case without using a udf or collect. — pault
– pault, Commented Aug 3, 2018 at 21:31

Pierre Gourseaud · Accepted Answer · 2018-08-03 12:13:38Z

14

I would split the column and make each element of the array a new column.

from pyspark.sql import functions as F

df = spark.createDataFrame(sc.parallelize([['1', 'val1, val2, val3, val4'], ['2', 'val1, val2, val3, val4'], ['3', 'val1, val2, val3, val4'], ['4', 'val1, val2, val3, val4']]), ["col1", "col2"])

df2 = df.select('col1', F.split('col2', ', ').alias('col2'))

# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]

df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
>>>
+----+-------+-------+-------+-------+
|col1|col2[0]|col2[1]|col2[2]|col2[3]|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+

edited Aug 3, 2018 at 12:13

answered Aug 3, 2018 at 11:52

Pierre Gourseaud

2,49716 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

exAres Over a year ago

Yes! F.split() was the way to go!

deathrace Over a year ago

Is there any way to change newly generated column names . e.g. level1, level2 etc.. instead of col1, col2

deathrace Over a year ago

I am using this for now: df_res = df_result.toDF(*(c.replace('col2', 'level') for c in df_result.columns))

Collectives™ on Stack Overflow

How to split a column with comma separated values in PySpark's Dataframe?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related