Explode multiple columns, keeping column name in PySpark

Question

I have the following PySpark dataframe (first_df):

id	cat	dog	bird
0	["persan", "sphynx"]	[]	["strisores"]
1	[]	["bulldog"]	["columbaves", "gruiformes"]
2	["ragdoll"]	["labrador"]	[]

And I would like to explode multiple columns at once, keeping the old column names in a new column, such as:

id	animal	animal_type
0	persan	cat
0	sphynx	cat
0	strisores	bird
1	bulldog	dog
1	columbaves	bird
1	gruiformes	bird
2	ragdoll	cat
2	labrador	dog

So far, my current solution is the following:

animal_types = ['cat', 'dog', 'bird']
df = spark.createDataFrame([], schema=StructType([
    StructField('id', StringType()),
    StructField('animal', StringType()),
    StructField('animal_type', StringType())
]))

for animal_type in animal_types:
  df = first_df \
    .select('id', animal_type) \
    .withColumn('animal', F.explode(animal_type)) \
    .drop(animal_type) \
    .withColumn('animal_type', F.lit(animal_type.upper())) \
    .union(df)

But I found it quite inefficient, particularly when working in clusters.

Is there a better spark way to accomplish this?

mck · Accepted Answer · 2021-06-13 07:23:09Z

3

You can unpivot and explode the array:

df2 = df.selectExpr(
    'id', 
    'stack(' + str(len(df.columns[1:])) + ', ' + ', '.join(["%s, '%s'" % (col,col) for col in df.columns[1:]]) + ') as (animal, animal_type)'
).withColumn(
    'animal', 
    F.explode('animal')
)

df2.show()
+---+----------+-----------+
| id|    animal|animal_type|
+---+----------+-----------+
|  0| strisores|       bird|
|  0|    persan|        cat|
|  0|    sphynx|        cat|
|  1|columbaves|       bird|
|  1|gruiformes|       bird|
|  1|   bulldog|        dog|
|  2|   ragdoll|        cat|
|  2|  labrador|        dog|
+---+----------+-----------+

answered Jun 13, 2021 at 7:23

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Explode multiple columns, keeping column name in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related