Replace substring in string using list or dict in Polars

Question

I'm working on a NLP model at the moment and are currently optimizing the pre-processing steps. Since I'm using a custom function polars cannot parallelize the operation.

I've tried few things with polars "replace_all" and some ".when.then.otherwise" but have not found a solution yet.

In this case am I doing "expand contractions" (e.g. I'm -> I am).

I currently use this:

import re
import polars as pl

# This is only a few example contractions that I use.
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

c_re = re.compile("(%s)" % "|".join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]

    return c_re.sub(replace, text)


df = pl.DataFrame({"Text": ["i'm i've, isn't"]})
df["Text"].map_elements(expandContractions)

Outputs

shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

But would like to use the full performance benfits of polars because the datasets I process are quite large.

Performance test:

#This dict have 100+ key/value pairs in my test case
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

def base_case(sr: pl.Series) -> pl.Series:
    c_re = re.compile("(%s)" % "|".join(cList.keys()))
    def expandContractions(text, c_re=c_re):
        def replace(match):
            return cList[match.group(0)]

        return c_re.sub(replace, text)

    sr = sr.map_elements(expandContractions)
    return sr


def loop_case(sr: pl.Series) -> pl.Series:

    for old, new in cList.items():
        sr = sr.str.replace_all(old, new, literal=True)

    return sr



def iter_case(sr: pl.Series) -> pl.Series:
    sr = functools.reduce(
        lambda res, x: getattr(getattr(res, "str"), "replace_all")(
            x[0], x[1], literal=True
        ),
        cList.items(),
        sr,
    )
    return sr

They all return equal results and here are the average times for 15 loops of ~10,000 samples with a sample length of ~500 characters.

Base case: 16.112362766265868
Loop case: 7.028670716285705
Iter case: 7.112465214729309

So it is more than double the speed using either of these methods and that's mostly thanks to polars API-call "replace_all". I ended up using the loop case since then I've one less module to import. See this question answered by jqurious

I had a look at it before but maybe I didn't get it then. But it is basically just looping over the dict, replacing each key with its value. Should work! I will test it tomorrow when I'm back at work. However, it would be nice if there is a polars-idiomatic way to do this as well! — nklsla
– nklsla, Commented Jan 23, 2023 at 17:08
The loop from that question is what I ended up using, mostly because i dont need to import any other modules. — nklsla
– nklsla, Commented Jan 24, 2023 at 7:53
There have been requests to simplify this operation - github.com/pola-rs/polars/issues/5815 is the most recent discussion I could find. — jqurious
– jqurious, Commented Jan 24, 2023 at 15:10
Ah I see, thanks for telling me. I linked to this question if anyone else stumbles into this problem from there. — nklsla
– nklsla, Commented Jan 24, 2023 at 15:38

jqurious · Accepted Answer · 2024-07-06 12:47:11Z

1

.str.replace_many() has since been added to Polars. (for non-regex replacements)

df.with_columns(
   pl.col.Text.str.replace_many(
      ["i'm", "i've", "isn't"], 
      ["i am", "i have", "is not"]
   )
)

shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

It currently requires passing "old" and "new" separately, but will also accept a dictionary in the future.

https://github.com/pola-rs/polars/issues/17220

edited Jul 6, 2024 at 12:47

answered Jul 6, 2024 at 12:41

jqurious

24.2k6 gold badges24 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ignoring_gravity · Accepted Answer · 2023-01-23 17:13:33Z

1

How about

(
    df['Text']
    .str.replace_all("i'm", "i am", literal=True)
    .str.replace_all("i've", "i have", literal=True)
    .str.replace_all("isn't", "is not", literal=True)
)

?

or:

functools.reduce(
    lambda res, x: getattr(
        getattr(res, "str"), "replace_all"
    )(x[0], x[1], literal=True),
    cList.items(),
    df["Text"],
)

edited Jan 23, 2023 at 17:13

answered Jan 23, 2023 at 16:45

ignoring_gravity

11.1k7 gold badges45 silver badges99 bronze badges

5 Comments

nklsla Over a year ago

Thanks, you're right but I've only given a short snippet of all the contractions needed. There is probably 100+ cases and a code like that few people would be proud of. Sorry for the miss understanding

ignoring_gravity Over a year ago

sure, I've given another suggestion

nklsla Over a year ago

Thanks, I will try it out tomorrow when I'm back at work!

nklsla Over a year ago

I updated the question with some performance test. I ended up using another version of yours due to not needing to import and a for loop was marginally faster too.

ignoring_gravity Over a year ago

interesting, thanks! you've done the right thing - "measure, don't guess"

Collectives™ on Stack Overflow

Replace substring in string using list or dict in Polars

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related