I have the following PySpark dataframe (first_df):
| id | cat | dog | bird |
|---|---|---|---|
| 0 | ["persan", "sphynx"] | [] | ["strisores"] |
| 1 | [] | ["bulldog"] | ["columbaves", "gruiformes"] |
| 2 | ["ragdoll"] | ["labrador"] | [] |
And I would like to explode multiple columns at once, keeping the old column names in a new column, such as:
| id | animal | animal_type |
|---|---|---|
| 0 | persan | cat |
| 0 | sphynx | cat |
| 0 | strisores | bird |
| 1 | bulldog | dog |
| 1 | columbaves | bird |
| 1 | gruiformes | bird |
| 2 | ragdoll | cat |
| 2 | labrador | dog |
So far, my current solution is the following:
animal_types = ['cat', 'dog', 'bird']
df = spark.createDataFrame([], schema=StructType([
StructField('id', StringType()),
StructField('animal', StringType()),
StructField('animal_type', StringType())
]))
for animal_type in animal_types:
df = first_df \
.select('id', animal_type) \
.withColumn('animal', F.explode(animal_type)) \
.drop(animal_type) \
.withColumn('animal_type', F.lit(animal_type.upper())) \
.union(df)
But I found it quite inefficient, particularly when working in clusters.
Is there a better spark way to accomplish this?