How to create Boolean if a Pyspark column string is found in a list of strings?

Question

I have a Spark Dataframe that has a column containing strings. These strings are referencing beverages, but can also include amounts / volumes / etc (there is no consistency so a regular expression can help clean this up, but can not resolve this). As a way to circumvent that I was hoping to use a filter to determine if the column string is found in a list and then generate a new column with a boolean, but am not sure the best way to do so.

I tried using case-when logic but that did not work.

I prefer contains because it can account for not exact matching which isin will require.

data = [
    [
        1,
        "SODA",
    ],
    [
        2,
        "JUICE 1L",
    ],
    [
        3,
        "WATER 64OZ",
    ],
    [
        4,
        "HOT TEA",
    ],
]

df = pd.DataFrame(data, columns=["ID", "Beverage"])

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
sdf = spark.createDataFrame(df)

Does anyone know the best way to do this?

vilalabinot · Accepted Answer · 2022-08-12 18:41:57Z

1

Assume this is your drinks array (or list):

val drinks = Array("SODA", "WATER", "COFFEE", "TEA", "JUICE")

We can convert this to a regex expression so we can apply it in rlike API:

val regex = drinks.map(x => "(" + x + ")").mkString("|").toLowerCase()

which gives (soda)|(water)|(coffee)|(tea)|(juice).

Then, we can apply this to the dataset, like:

df = df.withColumn("is_within", rlike(lower(beverage), regex))

PS: converting the column to lowercase but also your list, just to be sure that everything is compatible.

I would rather use rlike instead of contains, since I am not sure if we can come to the same result using the latter without doing individual checks.

Good luck!

answered Aug 12, 2022 at 18:41

vilalabinot

1,6216 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ged · Accepted Answer · 2022-08-12 18:56:19Z

1

%python

import pandas as pd
import pyspark.sql.functions as py 

data = [[1, 'SODA'], [2, 'JUICE'], [3, 'WATER'], [4, 'HOT STEA']]

df = pd.DataFrame(data, columns=['ID', 'Beverage'])
sdf = spark.createDataFrame(df)
sdf.show()

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]

# May need upper or lowercase conversion. Different to scala api.
sdf = sdf.withColumn('check', py.when(py.length(py.regexp_extract('Beverage', '(?=^|\s)(' + '|'.join(DRINK_LIST) + ')(?=\s|$)', 0)) > 0, True).otherwise(False))    
sdf.show()

returns:

+---+--------+-----+
| ID|Beverage|check|
+---+--------+-----+
|  1|    SODA| true|
|  2|   JUICE| true|
|  3|   WATER| true|
|  4|HOT STEA|false|
+---+--------+-----+

answered Aug 12, 2022 at 18:56

Ged

18.5k8 gold badges53 silver badges108 bronze badges

Collectives™ on Stack Overflow

How to create Boolean if a Pyspark column string is found in a list of strings?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related