2

I split a column with multiple underscores but now I am looking to remove the first index from that array... The element at the first index changes names as you go down the rows so can't remove based on any value..

Column
abc1_food_1_3
abc2_drink_2_6
abc4_2

split(df.Column, '_').alias('Split_Column')

Split_Column
[abc1, food, 1, 3]
[abc2, drink, 2, 6]
[abc4, 2]

now how can I yield:

Split_Column
[food, 1, 3]
[drink, 2, 6]
[2]

I will be converting the array column back to a string with underscores afterwards.. (concat_ws I believe?)

4 Answers 4

4

It seems this might be helpful . --

df=df.withColumn("Split_Column_PROCESSED", F.expr("slice(Split_Column, 2, SIZE(Split_Column))"))

i am adding a snippet using this .

It's performance might be better.

>>> df.printSchema()
root
 |-- COLA: array (nullable = true)
 |    |-- element: long (containsNull = true)
>>> df.show()
+--------------------+
|                COLA|
+--------------------+
|        [1, 2, 4, 5]|
|[3, 57, 29, 34, 494]|
+--------------------+


import pyspark.sql.functions as F

df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))

>>> df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))
>>> df.show()
+--------------------+-----------------+
|                COLA|            FINAL|
+--------------------+-----------------+
|        [1, 2, 4, 5]|        [2, 4, 5]|
|[3, 57, 29, 34, 494]|[57, 29, 34, 494]|
+--------------------+-----------------+
Sign up to request clarification or add additional context in comments.

Comments

2

Of course after asking this I found a solution:

expr("filter(Split_Column, x -> not(x <=> Split_Column[0]))").alias('Split_Column')

Is there another way this can be done perhaps coupling array_remove and element_at?

Comments

1

If you simply want to remove the string before the first underscore, you can do:

df.selectExpr('substring_index(Column, "_", -size(split(Column, "_")) + 1)')

Example:

df = spark.createDataFrame([['abc1_food_1_3'],['abc2_drink_2_6'],['abc4_2']]).toDF('Column')
df
+--------------+
|        Column|
+--------------+
| abc1_food_1_3|
|abc2_drink_2_6|
|        abc4_2|
+--------------+

df = df.selectExpr('substring_index(Column, "_", -size(split(Column, "_"))+1) as trimmed')
df
+---------+
|  trimmed|
+---------+
| food_1_3|
|drink_2_6|
|        2|
+---------+

1 Comment

thank you mck! This solution allows me to split, remove the [0] index, and concat back to a string column all in one step
1

You can also try below code.

expr("filter(Split_Column, (x,i) -> i != 0)").alias("Split_Column") // in this i is index of array.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.