-1

I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?

data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
|               data|
+-------------------+
|  [cat, dog, sheep]|
|  [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+

Expected DataFrame:

+--------------+
|          data|
+--------------+
|    [cat, dog]|
|  [bus, truck]|
|  [ice, pizza]|
+--------------+
2
  • Are all the lists of the same size? Do you know that length ahead of time? Commented Dec 17, 2018 at 15:42
  • Yeah, they were all of size 3. If you have any method to achieve the result avoiding a UDF, kindly pen it down. Many thanks! Commented Dec 17, 2018 at 15:47

1 Answer 1

3

UDF is the best thing you can find for PySpark :)

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType

# Get the fist two elements 
split_row = udf(lambda row: row[:2])

# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))

new_df.show()
# Output

+------------+
|        data|
+------------+
|  [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+
Sign up to request clarification or add additional context in comments.

5 Comments

I know how to do with UDF, but wanted to know how we can do that without using any UDF. UDF cause immense overhead because of serialization when the dataframe is very big, that's why I wanted to avoid it. Thanks for your efforts, very appreciated :)
There is nothing better than UDF if you want to work on big loads and apply current operations you can't usually do ;)
Hi, If you check the execution plan, you can see the difference, especially on big loads. BTW, I haven't marked this answer negative.
Yup I know that changes the execution plan. If it doesn't change, it is much much slower. I don't see any "easy" way without UDF to do it tho
Yes, that's a fair comment. So, I suppose there is none. Though it doesn't answer my question, but I will upvote it as at this time there seems to be no better solution on the horizon. Many many thanks Sir.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.