Expand column with array of structs into new columns

Question

I have a DataFrame with a single column which is an array of structs

df.printSchema()
root
 |-- dataCells: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- label: string (nullable = true)
 |    |    |-- value: string (nullable = true)

Some sample data might look like this:

df.first()
Row(dataCells=[Row(label="firstName", value="John"), Row(label="lastName", value="Doe"), Row(label="Date", value="1/29/2018")])

I'm trying to figure out how to reformat this DataFrame by turning each struct into a named column. I want to have a DataFrame like this:

------------------------------------
| firstName | lastName | Date      |
------------------------------------
| John      | Doe      | 1/29/2018 |
| ....      | ...      | ...       |

I've tried everything I can think of but haven't figured this out.

Alper t. Turker · Accepted Answer · 2018-01-29 22:58:02Z

7

Just explode and select *

from pyspark.sql.functions import explode, first, col, monotonically_increasing_id

df = spark.createDataFrame([
  Row(dataCells=[Row(label="firstName", value="John"), Row(label="lastName", value="Doe"), Row(label="Date", value="1/29/2018")])
])

long = (df
   .withColumn("id", monotonically_increasing_id())
   .select("id", explode("dataCells").alias("col"))
   .select("id", "col.*"))

and pivot:

long.groupBy("id").pivot("label").agg(first("value")).show()
# +-----------+---------+---------+--------+                                      
# |         id|     Date|firstName|lastName|
# +-----------+---------+---------+--------+
# |25769803776|1/29/2018|     John|     Doe|
# +-----------+---------+---------+--------+

You can also:

from pyspark.sql.functions import udf

@udf("map<string,string>")
def as_map(x):
    return dict(x)

cols = [col("dataCells")[c].alias(c) for c in ["Date", "firstName", "lastName"]]
df.select(as_map("dataCells").alias("dataCells")).select(cols).show()

# +---------+---------+--------+
# |     Date|firstName|lastName|
# +---------+---------+--------+
# |1/29/2018|     John|     Doe|
# +---------+---------+--------+

References:

edited Jan 29, 2018 at 22:58

answered Jan 29, 2018 at 22:46

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Burke Over a year ago

Great answer. I ran into the following error "The pivot column label has more than 10000 distinct values", which gave me some concerns about the performance of this approach in the long run.

Alper t. Turker Over a year ago

It is a concern. Spark doesn't handle wide data well. In that case I'd recommend the second solution - both explode and pivot are on the expensive side.

Shivam Over a year ago

This works perfectly but with a slight caveat, if you have a record which is an empty array and you explode it, the row would be eliminated altogether, which might be a problem if you want to preserve empties. I suggest, using explode_outer instead and after pivoting, the result would have a null column, which you can subsequently drop.

Suresh · Accepted Answer · 2018-01-31 12:18:53Z

1

An alternate approach I tried without UDF,

>>> df.show()
+--------------------+
|           dataCells|
+--------------------+
|[[firstName,John]...|
+--------------------+

>>> from pyspark.sql import functions as F

## size of array with maximum length in column 
>>> arr_len = df.select(F.max(F.size('dataCells')).alias('len')).first().len

## get values from struct 
>>> df1 = df.select([df.dataCells[i].value for i in range(arr_len)])
>>> df1.show()
+------------------+------------------+------------------+
|dataCells[0].value|dataCells[1].value|dataCells[2].value|
+------------------+------------------+------------------+
|              John|               Doe|         1/29/2018|
+------------------+------------------+------------------+

>>> oldcols = df1.columns

## get the labels from struct
>>> cols = df.select([df.dataCells[i].label.alias('col_%s'%i) for i in range(arr_len)]).dropna().first()
>>> cols
Row(dataCells[0].label=u'firstName', dataCells[1].label=u'lastName', dataCells[2].label=u'Date')
>>> newcols = [cols[i] for i in range(arr_len)]
>>> newcols
[u'firstName', u'lastName', u'Date']

## use the labels to rename the columns
>>> df2 = reduce(lambda data, idx: data.withColumnRenamed(oldcols[idx], newcols[idx]), range(len(oldcols)), df1)
>>> df2.show()
+---------+--------+---------+
|firstName|lastName|     Date|
+---------+--------+---------+
|     John|     Doe|1/29/2018|
+---------+--------+---------+

edited Jan 31, 2018 at 12:18

answered Jan 30, 2018 at 3:46

Suresh

5,8802 gold badges27 silver badges42 bronze badges

1 Comment

Alper t. Turker Over a year ago

Note that this will work only if all rows contain the same set of tuples in the same order. That's a risky assumption.

Collectives™ on Stack Overflow

Expand column with array of structs into new columns

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related