Convert Array values into Column name for another Array value in PySpark

Question

I have the below input table with array values in it. To get the expected output I've tried array_contains, but wasn't able to get it.

The array index numbers should exactly match between the two columns

Input:
+---------------+------------------+--------------------+
|ID             |col_List          |col_ID              |
+---------------+------------------+--------------------+
|10001          |["T1","T2]        |["100","200"]       |
+---------------+------------------+--------------------+
|10002          |["T2"]            |["300"]             |
+---------------+------------------+--------------------+
|10003          |["T1","T2","T3"]  |["150","220","580"  |
+---------------+------------------+--------------------+

Expected output:
+---------------+--------+--------+--------+
|ID             |T1      |T2      |T3      |
+---------------+--------+--------+--------+
|10001          |100     |200     |null    |
+---------------+--------+--------+--------+
|10002          |null    |300     |null    |
+---------------+--------+--------+--------+
|10003          |150     |220     |580     |
+---------------+--------+--------+--------+

samkart · Accepted Answer · 2022-08-04 05:31:21Z

1

We can use arrays_zip to create a struct using the values of the arrays. So, first elements from both the arrays will be used to create a struct, then second elements will be used to create another struct, and so on. Explode the resulting array of structs to get individual rows for each values. We can then pivot the col_list and use col_id values.

data_sdf. \
    withColumn('col_struct_arr', func.arrays_zip('col_list', 'col_id')). \
    withColumn('col_struct', func.explode('col_struct_arr')). \
    select('id', func.col('col_struct.*')). \
    groupBy('id'). \
    pivot('col_list'). \
    agg(func.first('col_id')). \
    show()

# +-----+----+---+----+
# |   id|  t1| t2|  t3|
# +-----+----+---+----+
# |10002|null|300|null|
# |10001| 100|200|null|
# |10003| 150|220| 580|
# +-----+----+---+----+

answered Aug 4, 2022 at 5:31

samkart

6,7133 gold badges19 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ZygD · Accepted Answer · 2022-08-05 06:10:19Z

0

After arrays_zip, you can do inline explosion. Then, pivot looks necessary.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(10001, ["T1","T2"], ["100","200"]),
     (10002, ["T2"], ["300"]),
     (10003, ["T1","T2","T3"], ["150","220","580"])],
    ['ID', 'col_List', 'col_ID'])

Script:

df = df.selectExpr("ID", "inline(arrays_zip(col_List, col_ID))")
df = df.groupBy('ID').pivot('col_List').agg(F.first('col_ID'))

df.show()
# +-----+----+---+----+
# |ID   |T1  |T2 |T3  |
# +-----+----+---+----+
# |10001|100 |200|null|
# |10002|null|300|null|
# |10003|150 |220|580 |
# +-----+----+---+----+

answered Aug 5, 2022 at 6:10

ZygD

24.8k41 gold badges107 silver badges145 bronze badges

Collectives™ on Stack Overflow

Convert Array values into Column name for another Array value in PySpark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related