0

I am trying to find a way to filter based off the value of one column to search in another column. If I have a value in one column, I want verify that the value is also in the array in a different column.

I tried the following:

df = sc.parallelize([('v1', ['v1','v2','v3']),('v4' ['v1','v2','v4'])]).toDF()
df.filter(pyspark.sql.functions.array_contains(df._2, df._1)).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/pyspark/sql/functions.py", line 1648, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1088, in _build_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1075, in _get_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 512, in convert

TypeError: 'Column' object is not callable

What I am looking for is something similar to

df.filter(pyspark.sql.functions.array_contains(df._2, 'v4'))

but I don't want to use a static value. I wanted to use the value from column _1.

1 Answer 1

1

You have to use expression for this:

df.filter("array_contains(_2, _1)").show()
+---+------------+
| _1|          _2|
+---+------------+
| v1|[v1, v2, v3]|
| v4|[v1, v2, v4]|
+---+------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.