I have a Spark Dataframe that has a column containing strings. These strings are referencing beverages, but can also include amounts / volumes / etc (there is no consistency so a regular expression can help clean this up, but can not resolve this). As a way to circumvent that I was hoping to use a filter to determine if the column string is found in a list and then generate a new column with a boolean, but am not sure the best way to do so.
I tried using case-when logic but that did not work.
I prefer contains because it can account for not exact matching which isin will require.
data = [
[
1,
"SODA",
],
[
2,
"JUICE 1L",
],
[
3,
"WATER 64OZ",
],
[
4,
"HOT TEA",
],
]
df = pd.DataFrame(data, columns=["ID", "Beverage"])
DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
sdf = spark.createDataFrame(df)
Does anyone know the best way to do this?