Pyspark remove first element of array

Question

I split a column with multiple underscores but now I am looking to remove the first index from that array... The element at the first index changes names as you go down the rows so can't remove based on any value..

Column
abc1_food_1_3
abc2_drink_2_6
abc4_2

split(df.Column, '_').alias('Split_Column')

Split_Column
[abc1, food, 1, 3]
[abc2, drink, 2, 6]
[abc4, 2]

now how can I yield:

Split_Column
[food, 1, 3]
[drink, 2, 6]
[2]

I will be converting the array column back to a string with underscores afterwards.. (concat_ws I believe?)

Aditya Vikram Singh · Accepted Answer · 2020-12-01 07:00:54Z

It seems this might be helpful . --

df=df.withColumn("Split_Column_PROCESSED", F.expr("slice(Split_Column, 2, SIZE(Split_Column))"))

i am adding a snippet using this .

It's performance might be better.

>>> df.printSchema()
root
 |-- COLA: array (nullable = true)
 |    |-- element: long (containsNull = true)
>>> df.show()
+--------------------+
|                COLA|
+--------------------+
|        [1, 2, 4, 5]|
|[3, 57, 29, 34, 494]|
+--------------------+


import pyspark.sql.functions as F

df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))

>>> df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))
>>> df.show()
+--------------------+-----------------+
|                COLA|            FINAL|
+--------------------+-----------------+
|        [1, 2, 4, 5]|        [2, 4, 5]|
|[3, 57, 29, 34, 494]|[57, 29, 34, 494]|
+--------------------+-----------------+

pdangelo4 · Accepted Answer · 2020-12-01 05:55:38Z

2

Of course after asking this I found a solution:

expr("filter(Split_Column, x -> not(x <=> Split_Column[0]))").alias('Split_Column')

Is there another way this can be done perhaps coupling array_remove and element_at?

answered Dec 1, 2020 at 5:55

pdangelo4

2405 silver badges18 bronze badges

Comments

mck · Accepted Answer · 2020-12-01 07:17:23Z

1

If you simply want to remove the string before the first underscore, you can do:

df.selectExpr('substring_index(Column, "_", -size(split(Column, "_")) + 1)')

Example:

df = spark.createDataFrame([['abc1_food_1_3'],['abc2_drink_2_6'],['abc4_2']]).toDF('Column')
df
+--------------+
|        Column|
+--------------+
| abc1_food_1_3|
|abc2_drink_2_6|
|        abc4_2|
+--------------+

df = df.selectExpr('substring_index(Column, "_", -size(split(Column, "_"))+1) as trimmed')
df
+---------+
|  trimmed|
+---------+
| food_1_3|
|drink_2_6|
|        2|
+---------+

answered Dec 1, 2020 at 7:17

mck

42.7k13 gold badges44 silver badges62 bronze badges

1 Comment

pdangelo4 Over a year ago

thank you mck! This solution allows me to split, remove the [0] index, and concat back to a string column all in one step

s.polam · Accepted Answer · 2020-12-01 07:57:13Z

1

You can also try below code.

expr("filter(Split_Column, (x,i) -> i != 0)").alias("Split_Column") // in this i is index of array.

edited Dec 1, 2020 at 7:57

answered Dec 1, 2020 at 7:51

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Collectives™ on Stack Overflow

Pyspark remove first element of array

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related