3

I'm working on a NLP model at the moment and are currently optimizing the pre-processing steps. Since I'm using a custom function polars cannot parallelize the operation.

I've tried few things with polars "replace_all" and some ".when.then.otherwise" but have not found a solution yet.

In this case am I doing "expand contractions" (e.g. I'm -> I am).

I currently use this:

import re
import polars as pl

# This is only a few example contractions that I use.
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

c_re = re.compile("(%s)" % "|".join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]

    return c_re.sub(replace, text)


df = pl.DataFrame({"Text": ["i'm i've, isn't"]})
df["Text"].map_elements(expandContractions)

Outputs

shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

But would like to use the full performance benfits of polars because the datasets I process are quite large.


Performance test:

#This dict have 100+ key/value pairs in my test case
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

def base_case(sr: pl.Series) -> pl.Series:
    c_re = re.compile("(%s)" % "|".join(cList.keys()))
    def expandContractions(text, c_re=c_re):
        def replace(match):
            return cList[match.group(0)]

        return c_re.sub(replace, text)

    sr = sr.map_elements(expandContractions)
    return sr


def loop_case(sr: pl.Series) -> pl.Series:

    for old, new in cList.items():
        sr = sr.str.replace_all(old, new, literal=True)

    return sr



def iter_case(sr: pl.Series) -> pl.Series:
    sr = functools.reduce(
        lambda res, x: getattr(getattr(res, "str"), "replace_all")(
            x[0], x[1], literal=True
        ),
        cList.items(),
        sr,
    )
    return sr

They all return equal results and here are the average times for 15 loops of ~10,000 samples with a sample length of ~500 characters.

Base case: 16.112362766265868
Loop case: 7.028670716285705
Iter case: 7.112465214729309

So it is more than double the speed using either of these methods and that's mostly thanks to polars API-call "replace_all". I ended up using the loop case since then I've one less module to import. See this question answered by jqurious

5
  • 1
    Have you tried stackoverflow.com/a/74738355 Commented Jan 23, 2023 at 16:12
  • I had a look at it before but maybe I didn't get it then. But it is basically just looping over the dict, replacing each key with its value. Should work! I will test it tomorrow when I'm back at work. However, it would be nice if there is a polars-idiomatic way to do this as well! Commented Jan 23, 2023 at 17:08
  • The loop from that question is what I ended up using, mostly because i dont need to import any other modules. Commented Jan 24, 2023 at 7:53
  • 1
    There have been requests to simplify this operation - github.com/pola-rs/polars/issues/5815 is the most recent discussion I could find. Commented Jan 24, 2023 at 15:10
  • Ah I see, thanks for telling me. I linked to this question if anyone else stumbles into this problem from there. Commented Jan 24, 2023 at 15:38

2 Answers 2

1

.str.replace_many() has since been added to Polars. (for non-regex replacements)

df.with_columns(
   pl.col.Text.str.replace_many(
      ["i'm", "i've", "isn't"], 
      ["i am", "i have", "is not"]
   )
)
shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

It currently requires passing "old" and "new" separately, but will also accept a dictionary in the future.

Sign up to request clarification or add additional context in comments.

Comments

1

How about

(
    df['Text']
    .str.replace_all("i'm", "i am", literal=True)
    .str.replace_all("i've", "i have", literal=True)
    .str.replace_all("isn't", "is not", literal=True)
)

?


or:

functools.reduce(
    lambda res, x: getattr(
        getattr(res, "str"), "replace_all"
    )(x[0], x[1], literal=True),
    cList.items(),
    df["Text"],
)

5 Comments

Thanks, you're right but I've only given a short snippet of all the contractions needed. There is probably 100+ cases and a code like that few people would be proud of. Sorry for the miss understanding
sure, I've given another suggestion
Thanks, I will try it out tomorrow when I'm back at work!
I updated the question with some performance test. I ended up using another version of yours due to not needing to import and a for loop was marginally faster too.
interesting, thanks! you've done the right thing - "measure, don't guess"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.