0

I have a Spark Dataframe that has a column containing strings. These strings are referencing beverages, but can also include amounts / volumes / etc (there is no consistency so a regular expression can help clean this up, but can not resolve this). As a way to circumvent that I was hoping to use a filter to determine if the column string is found in a list and then generate a new column with a boolean, but am not sure the best way to do so.

I tried using case-when logic but that did not work.

I prefer contains because it can account for not exact matching which isin will require.

data = [
    [
        1,
        "SODA",
    ],
    [
        2,
        "JUICE 1L",
    ],
    [
        3,
        "WATER 64OZ",
    ],
    [
        4,
        "HOT TEA",
    ],
]

df = pd.DataFrame(data, columns=["ID", "Beverage"])

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
sdf = spark.createDataFrame(df)

Does anyone know the best way to do this?

2 Answers 2

1

Assume this is your drinks array (or list):

val drinks = Array("SODA", "WATER", "COFFEE", "TEA", "JUICE")

We can convert this to a regex expression so we can apply it in rlike API:

val regex = drinks.map(x => "(" + x + ")").mkString("|").toLowerCase()

which gives (soda)|(water)|(coffee)|(tea)|(juice).

Then, we can apply this to the dataset, like:

df = df.withColumn("is_within", rlike(lower(beverage), regex))

PS: converting the column to lowercase but also your list, just to be sure that everything is compatible.

I would rather use rlike instead of contains, since I am not sure if we can come to the same result using the latter without doing individual checks.

Good luck!

Sign up to request clarification or add additional context in comments.

Comments

1
%python

import pandas as pd
import pyspark.sql.functions as py 

data = [[1, 'SODA'], [2, 'JUICE'], [3, 'WATER'], [4, 'HOT STEA']]

df = pd.DataFrame(data, columns=['ID', 'Beverage'])
sdf = spark.createDataFrame(df)
sdf.show()

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]

# May need upper or lowercase conversion. Different to scala api.
sdf = sdf.withColumn('check', py.when(py.length(py.regexp_extract('Beverage', '(?=^|\s)(' + '|'.join(DRINK_LIST) + ')(?=\s|$)', 0)) > 0, True).otherwise(False))    
sdf.show()
         

returns:

+---+--------+-----+
| ID|Beverage|check|
+---+--------+-----+
|  1|    SODA| true|
|  2|   JUICE| true|
|  3|   WATER| true|
|  4|HOT STEA|false|
+---+--------+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.