2

I have the below input table with array values in it. To get the expected output I've tried array_contains, but wasn't able to get it.

The array index numbers should exactly match between the two columns

Input:
+---------------+------------------+--------------------+
|ID             |col_List          |col_ID              |
+---------------+------------------+--------------------+
|10001          |["T1","T2]        |["100","200"]       |
+---------------+------------------+--------------------+
|10002          |["T2"]            |["300"]             |
+---------------+------------------+--------------------+
|10003          |["T1","T2","T3"]  |["150","220","580"  |
+---------------+------------------+--------------------+

Expected output:
+---------------+--------+--------+--------+
|ID             |T1      |T2      |T3      |
+---------------+--------+--------+--------+
|10001          |100     |200     |null    |
+---------------+--------+--------+--------+
|10002          |null    |300     |null    |
+---------------+--------+--------+--------+
|10003          |150     |220     |580     |
+---------------+--------+--------+--------+
0

2 Answers 2

1

We can use arrays_zip to create a struct using the values of the arrays. So, first elements from both the arrays will be used to create a struct, then second elements will be used to create another struct, and so on. Explode the resulting array of structs to get individual rows for each values. We can then pivot the col_list and use col_id values.

data_sdf. \
    withColumn('col_struct_arr', func.arrays_zip('col_list', 'col_id')). \
    withColumn('col_struct', func.explode('col_struct_arr')). \
    select('id', func.col('col_struct.*')). \
    groupBy('id'). \
    pivot('col_list'). \
    agg(func.first('col_id')). \
    show()

# +-----+----+---+----+
# |   id|  t1| t2|  t3|
# +-----+----+---+----+
# |10002|null|300|null|
# |10001| 100|200|null|
# |10003| 150|220| 580|
# +-----+----+---+----+
Sign up to request clarification or add additional context in comments.

Comments

0

After arrays_zip, you can do inline explosion. Then, pivot looks necessary.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(10001, ["T1","T2"], ["100","200"]),
     (10002, ["T2"], ["300"]),
     (10003, ["T1","T2","T3"], ["150","220","580"])],
    ['ID', 'col_List', 'col_ID'])

Script:

df = df.selectExpr("ID", "inline(arrays_zip(col_List, col_ID))")
df = df.groupBy('ID').pivot('col_List').agg(F.first('col_ID'))

df.show()
# +-----+----+---+----+
# |ID   |T1  |T2 |T3  |
# +-----+----+---+----+
# |10001|100 |200|null|
# |10002|null|300|null|
# |10003|150 |220|580 |
# +-----+----+---+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.