Pyspark filter out empty lists using .filter()

Question

I have a pyspark dataframe where one column is filled with list, either containing entries or just empty lists. I want to efficiently filter out all rows that contain empty lists.

import pyspark.sql.functions as sf
df.filter(sf.col('column_with_lists') != [])

returns me the following error:

Py4JJavaError: An error occurred while calling o303.notEqual.
: java.lang.RuntimeException: Unsupported literal type class

Perhaps I can check the length of the list and impose it should be > 0 (see here). However, I am unsure how this syntax works if I am using pyspark-sql and if filter even allows a lambda.

Perhaps to make clear, I have multiple columns but want to apply the above filter on a single one, removing all entries. The linked SO example filters on a single column.

Thanks in advance!

gaatjeniksaan · Accepted Answer · 2019-04-09 12:04:45Z

26

So it appears it is as simple as using the size function from sql.functions:

import pyspark.sql.functions as sf
df.filter(sf.size('column_with_lists') > 0)

edited Apr 9, 2019 at 12:04

answered Feb 24, 2017 at 12:33

gaatjeniksaan

1,4412 gold badges12 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

derricw Over a year ago

What is sf here?

akki Over a year ago

@derricw The functions (pyspark.sql.functions to be precise) module .

Collectives™ on Stack Overflow

Pyspark filter out empty lists using .filter()

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related