Flatten data frame with array columns

Question

Suppose I have a PySpark dataframe whose df.printSchema() is:

root
 |-- shop_id: int (nullable = false)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_id: int (nullable = false)

How can one convert it into this:

root
 |-- shop_id: int (nullable = false)
 |-- item_id: int (nullable = false)

In other words, within each entry shop_id is "attached" to every item_id and these pairs are directed into a single stream.

A more visual explanation:

before

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

after

[
   {"shop_id":42,"item_id":101},
   {"shop_id":42,"item_id":102},
   {"shop_id":43,"item_id":203}
]

Maxim Blumental · Accepted Answer · 2020-07-01 18:33:07Z

tl;dr

df.select('shop_id',F.explode('items.item_id').alias('item_id'))

test

from pyspark.sql.types import StructType, StructField, ArrayType, StructType, IntegerType

schema = StructType([
    StructField('shop_id', IntegerType()),
    StructField('items', ArrayType(
        StructType([
            StructField('item_id', IntegerType()),
        ])
    ))
])

data = [
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

df = spark_session.createDataFrame(data, schema)

before

df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)

after

df = df.select('shop_id',F.explode('items.item_id').alias('item_id'))
df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)

df.collect()

[Row(shop_id=42, item_id=101),
 Row(shop_id=42, item_id=102),
 Row(shop_id=43, item_id=203)]

Collectives™ on Stack Overflow

Flatten data frame with array columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related