1

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:

df.filter(df.col_name.contains('substring'))

How do I extend this statement, or utilize another, to search through multiple columns for substring matches?

2 Answers 2

3

You can generalize the statement the filter in one go:

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

OR

You can simply loop over the columns and apply the same filter:

for col in df.columns:
    df = df.filter(df[col].contains("substring"))
Sign up to request clarification or add additional context in comments.

Comments

1

You can search through all columns and fill next dataframe and union results, like this:

columns = ["language", "else"]
data = [
    ("Java", "Python"),
    ("Python", "100000"),
    ("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()

schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)

for col in df.columns:
    df2 = df2.unionByName(df.filter(df[col].like("%Python%")))

df2.show()
+--------+------+
|language|  else|
+--------+------+
|  Python|100000|
|    Java|Python|
+--------+------+

Result will contain first 2 rows, because they have value 'Python' in some of the columns.

1 Comment

This one is working for me! df = df.filter(df[col].contains("substring")) is not giving any results even when changing it to df = df.filter(df[col].like("%substring%")). What is the difference with unionByName?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.