Split column of list into multiple columns in the same PySpark dataframe

Question

I have the following dataframe which contains 2 columns:

1st column has column names

2nd Column has list of values.

+--------------------+--------------------+
|              Column|            Quantile|
+--------------------+--------------------+
|                rent|[4000.0, 4500.0, ...|
|     is_rent_changed|[0.0, 0.0, 0.0, 0...|
|               phone|[7.022372888E9, 7...|
|          Area_house|[1000.0, 1000.0, ...|
|       bedroom_count|[1.0, 1.0, 1.0, 1...|
|      bathroom_count|[1.0, 1.0, 1.0, 1...|
|    maintenance_cost|[0.0, 0.0, 0.0, 0...|
|            latitude|[12.8217605, 12.8...|
|            Max_rent|[9000.0, 10000.0,...|
|                Beds|[2.0, 2.0, 2.0, 2...|
|                Area|[1000.0, 1000.0, ...|
|            Avg_Rent|[3500.0, 4000.0, ...|
|      deposit_amount|[0.0, 0.0, 0.0, 0...|
|          commission|[0.0, 0.0, 0.0, 0...|
|        monthly_rent|[0.0, 0.0, 0.0, 0...|
|is_min_rent_guara...|[0.0, 0.0, 0.0, 0...|
|min_guarantee_amount|[0.0, 0.0, 0.0, 0...|
|min_guarantee_dur...|[1.0, 1.0, 1.0, 1...|
|        furnish_cost|[0.0, 0.0, 0.0, 0...|
|  owner_furnish_part|[0.0, 0.0, 0.0, 0...|
+--------------------+--------------------+

How do I split the second column into Multiple Columns Preserving the same dataset.

I can access the values using :

univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]).show()

+--------------------+-------------+-------------+------------+
|              Column|  Quantile[0]|  Quantile[1]| Quantile[2]|
+--------------------+-------------+-------------+------------+
|                rent|       4000.0|       4500.0|      5000.0|
|     is_rent_changed|          0.0|          0.0|         0.0|
|               phone|7.022372888E9|7.042022842E9|7.07333021E9|
|          Area_house|       1000.0|       1000.0|      1000.0|
|       bedroom_count|          1.0|          1.0|         1.0|
|      bathroom_count|          1.0|          1.0|         1.0|
|    maintenance_cost|          0.0|          0.0|         0.0|
|            latitude|   12.8217605|   12.8490502|   12.863517|
|            Max_rent|       9000.0|      10000.0|     11500.0|
|                Beds|          2.0|          2.0|         2.0|
|                Area|       1000.0|       1000.0|      1000.0|
|            Avg_Rent|       3500.0|       4000.0|      4125.0|
|      deposit_amount|          0.0|          0.0|         0.0|
|          commission|          0.0|          0.0|         0.0|
|        monthly_rent|          0.0|          0.0|         0.0|
|is_min_rent_guara...|          0.0|          0.0|         0.0|
|min_guarantee_amount|          0.0|          0.0|         0.0|
|min_guarantee_dur...|          1.0|          1.0|         1.0|
|        furnish_cost|          0.0|          0.0|         0.0|
|  owner_furnish_part|          0.0|          0.0|         0.0|
+--------------------+-------------+-------------+------------+
only showing top 20 rows

I want my new dataframe to to split my 2nd column of lists into multiple columns like the above dataset. Thanks in advance.

What is the question? You seem like you already have what you're looking for. new_df = univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]) — pault
– pault, Commented Apr 4, 2018 at 14:59

desertnaut · Accepted Answer · 2018-04-05 10:21:23Z

10

Assuming (your question is flagged for closure as unclear what you're asking) that your issue is that the lists in your Quantile column are of some length, and so it is not convenient to build the respective command by hand, here is a solution using list addition and comprehension as an argument to select:

spark.version
# u'2.2.1'

# make some toy data
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
                            Row([0,0,0,85,0,69,0]),
                            Row([0,89,56,0,0,0,0])],
                            ['features'])

df.show()
# result:
+-----------------------+
|features               |
+-----------------------+
|[0, 45, 63, 0, 0, 0, 0]|
|[0, 0, 0, 85, 0, 69, 0]|
|[0, 89, 56, 0, 0, 0, 0]|
+-----------------------+

# get the length of your lists, if you don't know it already (here is 7):
length = len(df.select('features').take(1)[0][0])
length
# 7

df.select([df.features] + [df.features[i] for i in range(length)]).show()
# result:
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|            features|features[0]|features[1]|features[2]|features[3]|features[4]|features[5]|features[6]|  
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|[0, 45, 63, 0, 0,...|          0|         45|         63|          0|          0|          0|          0| 
|[0, 0, 0, 85, 0, ...|          0|          0|          0|         85|          0|         69|          0|
|[0, 89, 56, 0, 0,...|          0|         89|         56|          0|          0|          0|          0|
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

So, in your case,

univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])

should do the job, after you have calculated length as

length = len(univar_df10.select('Quantile').take(1)[0][0])

edited Apr 5, 2018 at 10:21

answered Apr 4, 2018 at 17:10

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jxn Over a year ago

how to do this in scala spark?

desertnaut Over a year ago

@jxn sorry, no idea on the Scala details

Deepak Over a year ago

Hi @jxn In Scala we can achieve this t0o. I am Using for with yield to achieve this. Check my answer ,hope it helps.

Deepak · Accepted Answer · 2020-09-10 12:59:30Z

2

Here's the pseudo code to do it in scala :-

import org.apache.spark.sql.functions.split 
import org.apache.spark.sql.functions.col

#Create column which you wanted to be .
val quantileColumn = Seq("quantile1","qunatile2","quantile3")

#Get the number of columns
val numberOfColums = quantileColumn.size

#Create a list of column
val columList = for (i <- 0 until numberOfColums ) yield split(col("Quantile"),",").getItem(i).alias(quantileColumn(i))

#Just perfom Select operation.
df.select(columList: _ *)

# If you want some columns to be added or dropped , use withColumn & dropp on df.

edited Sep 10, 2020 at 12:59

answered Sep 10, 2020 at 8:58

Deepak

3611 gold badge2 silver badges11 bronze badges

4 Comments

Deepak Over a year ago

Please use below imports import org.apache.spark.sql.functions.split import org.apache.spark.sql.functions.col

desertnaut Over a year ago

Please do not use the comments for adding material - edit & update your post instead. Also, kindly avoid answering follow-up questions from the comments - the present thread is clearly about pyspark

Deepak Over a year ago

Cool, I'll make a note of it.

desertnaut Over a year ago

Please add the imports to the answer!

Collectives™ on Stack Overflow

Split column of list into multiple columns in the same PySpark dataframe

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related