8

I have a pyspark dataframe where one column is filled with list, either containing entries or just empty lists. I want to efficiently filter out all rows that contain empty lists.

import pyspark.sql.functions as sf
df.filter(sf.col('column_with_lists') != []) 

returns me the following error:

Py4JJavaError: An error occurred while calling o303.notEqual.
: java.lang.RuntimeException: Unsupported literal type class

Perhaps I can check the length of the list and impose it should be > 0 (see here). However, I am unsure how this syntax works if I am using pyspark-sql and if filter even allows a lambda.

Perhaps to make clear, I have multiple columns but want to apply the above filter on a single one, removing all entries. The linked SO example filters on a single column.

Thanks in advance!

1 Answer 1

26

So it appears it is as simple as using the size function from sql.functions:

import pyspark.sql.functions as sf
df.filter(sf.size('column_with_lists') > 0)
Sign up to request clarification or add additional context in comments.

2 Comments

What is sf here?
@derricw The functions (pyspark.sql.functions to be precise) module .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.