How to add a group-specific index to a polars dataframe with an expression instead of a map_groups user-defined function?

Question

I am curious whether I am missing something in the Polars Expression library in how this could be done more efficiently. I have a dataframe of protein sequences, where I would like to create k-long substrings from the protein sequences, like the kmerize function below.

def kmerize(sequence, ksize):
    kmers = [sequence[i : (i + ksize)] for i in range(len(sequence) - ksize + 1)]
    return kmers

Within Polars, I did a group_by on the sequence, where within map_groups, the sequence was repeated by its length and exploded, then a row index was added. This row index was used to slice the sequences into k-mers, and then filtered by only keeping k-mers of the correct size.

Here is a minimally reproducible example:

from io import StringIO

import polars as pl

s = """sequence_name,sequence,length
sp|O43236|SEPT4_HUMAN Septin-4 OS=Homo sapiens OX=9606 GN=SEPTIN4 PE=1 SV=1,MDRSLGWQGNSVPEDRTEAGIKRFLEDTTDDGELSKFVKDFSGNASCHPPEAKTWASRPQVPEPRPQAPDLYDDDLEFRPPSRPQSSDNQQYFCAPAPLSPSARPRSPWGKLDPYDSSEDDKEYVGFATLPNQVHRKSVKKGFDFTLMVAGESGLGKSTLVNSLFLTDLYRDRKLLGAEERIMQTVEITKHAVDIEEKGVRLRLTIVDTPGFGDAVNNTECWKPVAEYIDQQFEQYFRDESGLNRKNIQDNRVHCCLYFISPFGHGLRPLDVEFMKALHQRVNIVPILAKADTLTPPEVDHKKRKIREEIEHFGIKIYQFPDCDSDEDEDFKLQDQALKESIPFAVIGSNTVVEARGRRVRGRLYPWGIVEVENPGHCDFVKLRTMLVRTHMQDLKDVTRETHYENYRAQCIQSMTRLVVKERNRNKLTRESGTDFPIPAVPPGTDPETEKLIREKDEELRRMQEMLHKIQKQMKENY,478
sp|O43521|B2L11_HUMAN Bcl-2-like protein 11 OS=Homo sapiens OX=9606 GN=BCL2L11 PE=1 SV=1,MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH,198
sp|O60238|BNI3L_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 3-like OS=Homo sapiens OX=9606 GN=BNIP3L PE=1 SV=1,MSSHLVEPPPPLHNNNNNCEENEQSLPPPAGLNSSWVELPMNSSNGNDNGNGKNGGLEHVPSSSSIHNGDMEKILLDAQHESGQSSSRGSSHCDSPSPQEDGQIMFDVEMHTSRDHSSQSEEEVVEGEKEVEALKKSADWVSDWSSRPENIPPKEFHFRHPKRSVSLSMRKSGAMKKGGIFSAEFLKVFIPSLFLSHVLALGLGIYIGKRLSTPSASTY,219
sp|O95197|RTN3_HUMAN Reticulon-3 OS=Homo sapiens OX=9606 GN=RTN3 PE=1 SV=2,MAEPSAATQSHSISSSSFGAEPSAPGGGGSPGACPALGTKSCSSSCADSFVSSSSSQPVSLFSTSQEGLSSLCSDEPSSEIMTSSFLSSSEIHNTGLTILHGEKSHVLGSQPILAKEGKDHLDLLDMKKMEKPQGTSNNVSDSSVSLAAGVHCDRPSIPASFPEHPAFLSKKIGQVEEQIDKETKNPNGVSSREAKTALDADDRFTLLTAQKPPTEYSKVEGIYTYSLSPSKVSGDDVIEKDSPESPFEVIIDKAAFDKEFKDSYKESTDDFGSWSVHTDKESSEDISETNDKLFPLRNKEAGRYPMSALLSRQFSHTNAALEEVSRCVNDMHNFTNEILTWDLVPQVKQQTDKSSDCITKTTGLDMSEYNSEIPVVNLKTSTHQKTPVCSIDGSTPITKSTGDWAEASLQQENAITGKPVPDSLNSTKEFSIKGVQGNMQKQDDTLAELPGSPPEKCDSLGSGVATVKVVLPDDHLKDEMDWQSSALGEITEADSSGESDDTVIEDITADTSFENNKIQAEKPVSIPSAVVKTGEREIKEIPSCEREEKTSKNFEELVSDSELHQDQPDILGRSPASEAACSKVPDTNVSLEDVSEVAPEKPITTENPKLPSTVSPNVFNETEFSLNVTTSAYLESLHGKNVKHIDDSSPEDLIAAFTETRDKGIVDSERNAFKAISEKMTDFKTTPPVEVLHENESGGSEIKDIGSKYSEQSKETNGSEPLGVFPTQGTPVASLDLEQEQLTIKALKELGERQVEKSTSAQRDAELPSEEVLKQTFTFAPESWPQRSYDILERNVKNGSDLGISQKPITIRETTRVDAVSSLSKTELVKKHVLARLLTDFSVHDLIFWRDVKKTGFVFGTTLIMLLSLAAFSVISVVSYLILALLSVTISFRIYKSVIQAVQKSEEGHPFKAYLDVDITLSSEAFHNYMNAAMVHINRALKLIIRLFLVEDLVDSLKLAVFMWLMTYVGAVFNGITLLILAELLIFSVPIVYEKYKTQIDHYVGIARDQTKSIVEKIQAKLPGIAKKKAE,1032
sp|P10415|BCL2_HUMAN Apoptosis regulator Bcl-2 OS=Homo sapiens OX=9606 GN=BCL2 PE=1 SV=2,MAHAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPVARTSPLQTPAAPGAAAGPALSPVPPVVHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK,239"""

ksize = 24

df = pl.scan_csv(StringIO(s))
df.group_by("sequence").map_groups(
    lambda group_df: group_df.with_columns(kmers=pl.col("sequence").repeat_by("length"))
    .explode("kmers")
    .with_row_index(),
    schema={"index": pl.UInt32, "sequence_name": pl.String, "sequence": pl.String, "length": pl.Int64, "kmers": pl.String},
).with_columns(pl.col("kmers").str.slice("index", ksize)).filter(
    pl.col("kmers").str.len_chars() == ksize
).rename(
    {"index": "start"}
).collect()

This code produces this dataframe:

Is there a more efficient way to do this in Polars? I will be using dataframes with ~250k sequences, each ~100-1000 letters long, so I'd like to do this as low-resource as possible.

Thank you and have a beautiful day!

There's a discussion of this topic here: github.com/pola-rs/polars/issues/10833 — rbierman
– rbierman, Commented Feb 26 at 17:01
I see that the option was added in this PR: github.com/pola-rs/polars/pull/13747 but I'm new to the expression syntax and it's unclear how to use it here for me. Can you give an example? — Olga Botvinnik
– Olga Botvinnik, Commented Feb 26 at 17:22

rbierman · Accepted Answer · 2025-02-27 15:31:33Z

1

I'm new to polars, so unfortunately I also don't know how to use the pull request 13747 updates, but issue 10833 had a code snippet and I tried to adapt your approach as well. I tried 3 different approaches shown below and got the following timings for a fake dataset of 25,000 sequences

Took 56.90 seconds for the approach you show above
Took 12.97 seconds for my adaptation of your approach
Took 5.72 seconds for the map_elements_approach from the github issue

Here's the code with edit from @jqurious:

import polars as pl
import numpy as np
import time

np.random.seed(0)
num_seqs = 25_000
min_seq_len = 100
max_seq_len = 1_000
seq_lens = np.random.randint(min_seq_len,max_seq_len,num_seqs)
sequences = [''.join(np.random.choice(['A','C','G','T'],seq_len)) for seq_len in seq_lens]
data = {'sequence': sequences, 'length': seq_lens}

df = pl.DataFrame(data)

ksize = 24

def op_approach(df):
    start = time.time()
    kmer_df = df.group_by("sequence").map_groups(
        lambda group_df: group_df.with_columns(kmers=pl.col("sequence").repeat_by("length"))
        .explode("kmers")
        .with_row_index()
    ).with_columns(
        pl.col("kmers").str.slice("index", ksize)
    ).filter(pl.col("kmers").str.len_chars() == ksize)
    print(f"Took {time.time()-start:.2f} seconds for op_approach")
    
    return kmer_df

def kmer_index_approach(df):
    start = time.time()
    kmer_df = df.with_columns(
        pl.int_ranges(0,pl.col("length").sub(ksize)+1).alias("kmer_starts")
    ).explode("kmer_starts").with_columns(
        pl.col("sequence").str.slice("kmer_starts", ksize).alias("kmers")
    )
    print(f"Took {time.time()-start:.2f} seconds for kmer_index_approach")
    
    return kmer_df
    

def map_elements_approach(df):
    #Stolen nearly directly from https://github.com/pola-rs/polars/issues/10833#issuecomment-1703894870
    start = time.time()
    def create_cngram(message, ngram=3):
        if ngram <= 0:
            return []
        return [{"start":i, "kmers":message[i:i+ngram]} for i in range(len(message) - ngram + 1)]

    kmer_df = df.with_columns(
        pl.col("sequence").map_elements(
            lambda message: create_cngram(message=message, ngram=ksize),
            return_dtype = pl.List(pl.Struct({"start": pl.Int64, "kmers": pl.String})),
        ).alias("kmers")
    ).explode("kmers").unnest("kmers")
    print(f"Took {time.time()-start:.2f} seconds for map_elements_approach")
    
    return kmer_df

op_res = op_approach(df)
kmer_index_res = kmer_index_approach(df)
map_res = map_elements_approach(df)


assert op_res["kmers"].sort().equals(map_res["kmers"].sort())
assert op_res["kmers"].sort().equals(kmer_index_res["kmers"].sort())

The kmer_index_approach is inspired by your use of str.slice which I think is cool. But it avoids having to do any grouping and it first explodes a new column of indices which might require less memory than replicating the entire sequence before replacement with a kmer. Also avoids having to do the filtering step to remove partial kmers. This results in an extra column kmer_starts which needs to be removed.

The map_elements_approach is based on the approach mentioned in the github issue where mmantyla uses map/apply to just apply a python function to all elements.

I'm personally surprised that the map_elements approach is the fastest, and by a large margin, but again I don't know if there's a different better approach based on the pull request you shared.

edited Feb 27 at 15:31

answered Feb 26 at 20:32

rbierman

914 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Olga Botvinnik Feb 26 at 22:18

Thank you! I also need the start position of each ngram/k-mer because I need to know where each k-mer is in the protein sequence for downstream processing, and I don't see that in the map_elements approach. The original approach used polars.DataFrame.with_row_index(), but that method doesn't exist for polars.Series objects. Is there another way?

jqurious Feb 27 at 10:36

@OlgaBotvinnik If you return a dict i.e. return [{"start": i, "kmers": message[i:i+ngram]} for i ... it will be of type pl.List(pl.Struct({"start": pl.Int64, "kmers": pl.String})) which you can .unnest() after the explode. Not sure what is up with the timings, the kmer_index approach is the fastest on my machine.

rbierman Feb 27 at 15:32

@OlgaBotvinnik I've edited the code to include @jqurious's suggestion so that now the map_elements gives the start index of each kmer. I'd be interested in the timings you observe.

Olga Botvinnik Feb 27 at 19:33

Thank you @jqurious and @rbierman! I From the original version (without start position), I get these timings:

Took 17.08 seconds for op_approach,  Took 11.47 seconds for kmer_index_approach,  Took 1.19 seconds for map_elements_approach

With start positions, I get these timings:

Took 18.04 seconds for op_approach, Took 11.62 seconds for kmer_index_approach, Took 3.82 seconds for map_elements_approach

. On my machine (Macbook M4 Max), map_elements is fastest!

orlp Feb 28 at 0:37

I made an issue for adding binary slicing: github.com/pola-rs/polars/issues/21514. As a temporary workaround, you can use a list of bytes:

df.lazy().with_columns(pl.col.sequence.cast(pl.Binary).cast(pl.List(pl.UInt8))) .with_columns(pl.int_ranges(0,pl.col("length").sub(ksize)+1).alias("kmer_starts") .explode("kmer_starts").with_columns(pl.col("sequence").list.slice("kmer_starts", ksize).alias("kmers") .collect(new_streaming=True)

On my machine this runs in ~1.23 seconds.

|

Dean MacGregor · Accepted Answer · 2025-02-28 16:52:37Z

Results upfront

TL;DR: Performance improvement from 17.02 seconds with map_elements_approach to 7.21 seconds with this.

Preamble

Ideally there'd be a way to use .list.eval to do everything in polars without pre-exploding and, with the following function, inspired by this one, you can.

import polars as pl
from inspect import signature
from typing import Callable
def list_eval_ref(
    listcol: pl.Expr | str,
    op: Callable[..., pl.Expr],
    *ref_cols: str | pl.Expr,
):
    if len(ref_cols) == 0:
        ref_cols = tuple([x for x in signature(op).parameters.keys()][1:])

    args_to_op = [pl.element().struct[0].explode()] + [
        pl.element().struct[i + 1] for i in range(len(ref_cols))
    ]
    return pl.concat_list(pl.struct(listcol, *ref_cols)).list.eval(op(*args_to_op))

This function rolls all the columns you need to interact with into a struct with your actual list thereby making them available inside of .list.eval. If our list is generated from pl.int_ranges then we can replicate the for i in range(len(message) - ngram + 1) entirely in polars.

Function for this problem (using the one from above)

With this problem we'd use it like this

def eval_trick_approach(df):
    start = time.time()
    kmer_df = (
        df.with_columns(
            list_eval_ref(
                pl.int_ranges(0, pl.col("sequence").str.len_chars() - ksize + 1),
                lambda i, sequence: pl.struct(
                    i.alias("start"), sequence.str.slice(i, ksize).alias("kmers")
                ),
            ).alias("comb")
        )
        .explode("comb")
        .unnest("comb")
    )
    print(f"Took {time.time()-start:.2f} seconds for eval_trick_approach")
    return kmer_df

The first argument to list_eval_ref is either an Expr or a column name. In this case it's pl.int_ranges which is a list from 0 to n. The second argument is a function (in this case a lambda) whose first argument will represent your initial list. The second (and higher) argument represent your other column(s). If those argument names match the actual column names then you don't need to specify them to list_eval_ref. Instead, it will get them from inspecting the function's signature. If you don't have matching argument/column names then you have to tell list_eval_ref about those columns/expressions as its third (and higher) argument.

For example, above, it uses lambda i, sequence where the first argument's name doesn't matter (it will be the list) but sequence matches the column name from the df. If it were entered as lambda i, anything_else: (...) then it'd have to be followed by "sequence" as the third argument to list_eval_ref

Note: The function used here must return a pl.Expr, it can't run arbitrary python functions.

Performance

On my machine, with @rbierman's test setup, map_elements_approach took 17.02 seconds whereas eval_helper_approach took 7.2 seconds.

Strange, on my machine I get eval_helper_approach of 13.19 seconds, and map_elements_approach of 6.17 seconds
That is really weird, I'm on Windows WSL Ubuntu. What about you?

Collectives™ on Stack Overflow

How to add a group-specific index to a polars dataframe with an expression instead of a map_groups user-defined function?

2 Answers 2

10 Comments

Results upfront

Preamble

Function for this problem (using the one from above)

Performance

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Results upfront

Preamble

Function for this problem (using the one from above)

Performance

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related