2

I am curious whether I am missing something in the Polars Expression library in how this could be done more efficiently. I have a dataframe of protein sequences, where I would like to create k-long substrings from the protein sequences, like the kmerize function below.

def kmerize(sequence, ksize):
    kmers = [sequence[i : (i + ksize)] for i in range(len(sequence) - ksize + 1)]
    return kmers

Within Polars, I did a group_by on the sequence, where within map_groups, the sequence was repeated by its length and exploded, then a row index was added. This row index was used to slice the sequences into k-mers, and then filtered by only keeping k-mers of the correct size.

Here is a minimally reproducible example:

from io import StringIO

import polars as pl

s = """sequence_name,sequence,length
sp|O43236|SEPT4_HUMAN Septin-4 OS=Homo sapiens OX=9606 GN=SEPTIN4 PE=1 SV=1,MDRSLGWQGNSVPEDRTEAGIKRFLEDTTDDGELSKFVKDFSGNASCHPPEAKTWASRPQVPEPRPQAPDLYDDDLEFRPPSRPQSSDNQQYFCAPAPLSPSARPRSPWGKLDPYDSSEDDKEYVGFATLPNQVHRKSVKKGFDFTLMVAGESGLGKSTLVNSLFLTDLYRDRKLLGAEERIMQTVEITKHAVDIEEKGVRLRLTIVDTPGFGDAVNNTECWKPVAEYIDQQFEQYFRDESGLNRKNIQDNRVHCCLYFISPFGHGLRPLDVEFMKALHQRVNIVPILAKADTLTPPEVDHKKRKIREEIEHFGIKIYQFPDCDSDEDEDFKLQDQALKESIPFAVIGSNTVVEARGRRVRGRLYPWGIVEVENPGHCDFVKLRTMLVRTHMQDLKDVTRETHYENYRAQCIQSMTRLVVKERNRNKLTRESGTDFPIPAVPPGTDPETEKLIREKDEELRRMQEMLHKIQKQMKENY,478
sp|O43521|B2L11_HUMAN Bcl-2-like protein 11 OS=Homo sapiens OX=9606 GN=BCL2L11 PE=1 SV=1,MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH,198
sp|O60238|BNI3L_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 3-like OS=Homo sapiens OX=9606 GN=BNIP3L PE=1 SV=1,MSSHLVEPPPPLHNNNNNCEENEQSLPPPAGLNSSWVELPMNSSNGNDNGNGKNGGLEHVPSSSSIHNGDMEKILLDAQHESGQSSSRGSSHCDSPSPQEDGQIMFDVEMHTSRDHSSQSEEEVVEGEKEVEALKKSADWVSDWSSRPENIPPKEFHFRHPKRSVSLSMRKSGAMKKGGIFSAEFLKVFIPSLFLSHVLALGLGIYIGKRLSTPSASTY,219
sp|O95197|RTN3_HUMAN Reticulon-3 OS=Homo sapiens OX=9606 GN=RTN3 PE=1 SV=2,MAEPSAATQSHSISSSSFGAEPSAPGGGGSPGACPALGTKSCSSSCADSFVSSSSSQPVSLFSTSQEGLSSLCSDEPSSEIMTSSFLSSSEIHNTGLTILHGEKSHVLGSQPILAKEGKDHLDLLDMKKMEKPQGTSNNVSDSSVSLAAGVHCDRPSIPASFPEHPAFLSKKIGQVEEQIDKETKNPNGVSSREAKTALDADDRFTLLTAQKPPTEYSKVEGIYTYSLSPSKVSGDDVIEKDSPESPFEVIIDKAAFDKEFKDSYKESTDDFGSWSVHTDKESSEDISETNDKLFPLRNKEAGRYPMSALLSRQFSHTNAALEEVSRCVNDMHNFTNEILTWDLVPQVKQQTDKSSDCITKTTGLDMSEYNSEIPVVNLKTSTHQKTPVCSIDGSTPITKSTGDWAEASLQQENAITGKPVPDSLNSTKEFSIKGVQGNMQKQDDTLAELPGSPPEKCDSLGSGVATVKVVLPDDHLKDEMDWQSSALGEITEADSSGESDDTVIEDITADTSFENNKIQAEKPVSIPSAVVKTGEREIKEIPSCEREEKTSKNFEELVSDSELHQDQPDILGRSPASEAACSKVPDTNVSLEDVSEVAPEKPITTENPKLPSTVSPNVFNETEFSLNVTTSAYLESLHGKNVKHIDDSSPEDLIAAFTETRDKGIVDSERNAFKAISEKMTDFKTTPPVEVLHENESGGSEIKDIGSKYSEQSKETNGSEPLGVFPTQGTPVASLDLEQEQLTIKALKELGERQVEKSTSAQRDAELPSEEVLKQTFTFAPESWPQRSYDILERNVKNGSDLGISQKPITIRETTRVDAVSSLSKTELVKKHVLARLLTDFSVHDLIFWRDVKKTGFVFGTTLIMLLSLAAFSVISVVSYLILALLSVTISFRIYKSVIQAVQKSEEGHPFKAYLDVDITLSSEAFHNYMNAAMVHINRALKLIIRLFLVEDLVDSLKLAVFMWLMTYVGAVFNGITLLILAELLIFSVPIVYEKYKTQIDHYVGIARDQTKSIVEKIQAKLPGIAKKKAE,1032
sp|P10415|BCL2_HUMAN Apoptosis regulator Bcl-2 OS=Homo sapiens OX=9606 GN=BCL2 PE=1 SV=2,MAHAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPVARTSPLQTPAAPGAAAGPALSPVPPVVHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK,239"""

ksize = 24

df = pl.scan_csv(StringIO(s))
df.group_by("sequence").map_groups(
    lambda group_df: group_df.with_columns(kmers=pl.col("sequence").repeat_by("length"))
    .explode("kmers")
    .with_row_index(),
    schema={"index": pl.UInt32, "sequence_name": pl.String, "sequence": pl.String, "length": pl.Int64, "kmers": pl.String},
).with_columns(pl.col("kmers").str.slice("index", ksize)).filter(
    pl.col("kmers").str.len_chars() == ksize
).rename(
    {"index": "start"}
).collect()

This code produces this dataframe:

Kmer dataframe produced by code above

Is there a more efficient way to do this in Polars? I will be using dataframes with ~250k sequences, each ~100-1000 letters long, so I'd like to do this as low-resource as possible.

Thank you and have a beautiful day!

2

2 Answers 2

1

I'm new to polars, so unfortunately I also don't know how to use the pull request 13747 updates, but issue 10833 had a code snippet and I tried to adapt your approach as well. I tried 3 different approaches shown below and got the following timings for a fake dataset of 25,000 sequences

  • Took 56.90 seconds for the approach you show above
  • Took 12.97 seconds for my adaptation of your approach
  • Took 5.72 seconds for the map_elements_approach from the github issue

Here's the code with edit from @jqurious:

import polars as pl
import numpy as np
import time

np.random.seed(0)
num_seqs = 25_000
min_seq_len = 100
max_seq_len = 1_000
seq_lens = np.random.randint(min_seq_len,max_seq_len,num_seqs)
sequences = [''.join(np.random.choice(['A','C','G','T'],seq_len)) for seq_len in seq_lens]
data = {'sequence': sequences, 'length': seq_lens}

df = pl.DataFrame(data)

ksize = 24

def op_approach(df):
    start = time.time()
    kmer_df = df.group_by("sequence").map_groups(
        lambda group_df: group_df.with_columns(kmers=pl.col("sequence").repeat_by("length"))
        .explode("kmers")
        .with_row_index()
    ).with_columns(
        pl.col("kmers").str.slice("index", ksize)
    ).filter(pl.col("kmers").str.len_chars() == ksize)
    print(f"Took {time.time()-start:.2f} seconds for op_approach")
    
    return kmer_df

def kmer_index_approach(df):
    start = time.time()
    kmer_df = df.with_columns(
        pl.int_ranges(0,pl.col("length").sub(ksize)+1).alias("kmer_starts")
    ).explode("kmer_starts").with_columns(
        pl.col("sequence").str.slice("kmer_starts", ksize).alias("kmers")
    )
    print(f"Took {time.time()-start:.2f} seconds for kmer_index_approach")
    
    return kmer_df
    

def map_elements_approach(df):
    #Stolen nearly directly from https://github.com/pola-rs/polars/issues/10833#issuecomment-1703894870
    start = time.time()
    def create_cngram(message, ngram=3):
        if ngram <= 0:
            return []
        return [{"start":i, "kmers":message[i:i+ngram]} for i in range(len(message) - ngram + 1)]

    kmer_df = df.with_columns(
        pl.col("sequence").map_elements(
            lambda message: create_cngram(message=message, ngram=ksize),
            return_dtype = pl.List(pl.Struct({"start": pl.Int64, "kmers": pl.String})),
        ).alias("kmers")
    ).explode("kmers").unnest("kmers")
    print(f"Took {time.time()-start:.2f} seconds for map_elements_approach")
    
    return kmer_df

op_res = op_approach(df)
kmer_index_res = kmer_index_approach(df)
map_res = map_elements_approach(df)


assert op_res["kmers"].sort().equals(map_res["kmers"].sort())
assert op_res["kmers"].sort().equals(kmer_index_res["kmers"].sort())

The kmer_index_approach is inspired by your use of str.slice which I think is cool. But it avoids having to do any grouping and it first explodes a new column of indices which might require less memory than replicating the entire sequence before replacement with a kmer. Also avoids having to do the filtering step to remove partial kmers. This results in an extra column kmer_starts which needs to be removed.

The map_elements_approach is based on the approach mentioned in the github issue where mmantyla uses map/apply to just apply a python function to all elements.

I'm personally surprised that the map_elements approach is the fastest, and by a large margin, but again I don't know if there's a different better approach based on the pull request you shared.

Sign up to request clarification or add additional context in comments.

10 Comments

Thank you! I also need the start position of each ngram/k-mer because I need to know where each k-mer is in the protein sequence for downstream processing, and I don't see that in the map_elements approach. The original approach used polars.DataFrame.with_row_index(), but that method doesn't exist for polars.Series objects. Is there another way?
@OlgaBotvinnik If you return a dict i.e. return [{"start": i, "kmers": message[i:i+ngram]} for i ... it will be of type pl.List(pl.Struct({"start": pl.Int64, "kmers": pl.String})) which you can .unnest() after the explode. Not sure what is up with the timings, the kmer_index approach is the fastest on my machine.
@OlgaBotvinnik I've edited the code to include @jqurious's suggestion so that now the map_elements gives the start index of each kmer. I'd be interested in the timings you observe.
Thank you @jqurious and @rbierman! I From the original version (without start position), I get these timings: Took 17.08 seconds for op_approach, Took 11.47 seconds for kmer_index_approach, Took 1.19 seconds for map_elements_approach With start positions, I get these timings: Took 18.04 seconds for op_approach, Took 11.62 seconds for kmer_index_approach, Took 3.82 seconds for map_elements_approach. On my machine (Macbook M4 Max), map_elements is fastest!
I made an issue for adding binary slicing: github.com/pola-rs/polars/issues/21514. As a temporary workaround, you can use a list of bytes: df.lazy().with_columns(pl.col.sequence.cast(pl.Binary).cast(pl.List(pl.UInt8))) .with_columns(pl.int_ranges(0,pl.col("length").sub(ksize)+1).alias("kmer_starts") .explode("kmer_starts").with_columns(pl.col("sequence").list.slice("kmer_starts", ksize).alias("kmers") .collect(new_streaming=True) On my machine this runs in ~1.23 seconds.
|
0

Results upfront

TL;DR: Performance improvement from 17.02 seconds with map_elements_approach to 7.21 seconds with this.

Preamble

Ideally there'd be a way to use .list.eval to do everything in polars without pre-exploding and, with the following function, inspired by this one, you can.

import polars as pl
from inspect import signature
from typing import Callable
def list_eval_ref(
    listcol: pl.Expr | str,
    op: Callable[..., pl.Expr],
    *ref_cols: str | pl.Expr,
):
    if len(ref_cols) == 0:
        ref_cols = tuple([x for x in signature(op).parameters.keys()][1:])

    args_to_op = [pl.element().struct[0].explode()] + [
        pl.element().struct[i + 1] for i in range(len(ref_cols))
    ]
    return pl.concat_list(pl.struct(listcol, *ref_cols)).list.eval(op(*args_to_op))

This function rolls all the columns you need to interact with into a struct with your actual list thereby making them available inside of .list.eval. If our list is generated from pl.int_ranges then we can replicate the for i in range(len(message) - ngram + 1) entirely in polars.

Function for this problem (using the one from above)

With this problem we'd use it like this

def eval_trick_approach(df):
    start = time.time()
    kmer_df = (
        df.with_columns(
            list_eval_ref(
                pl.int_ranges(0, pl.col("sequence").str.len_chars() - ksize + 1),
                lambda i, sequence: pl.struct(
                    i.alias("start"), sequence.str.slice(i, ksize).alias("kmers")
                ),
            ).alias("comb")
        )
        .explode("comb")
        .unnest("comb")
    )
    print(f"Took {time.time()-start:.2f} seconds for eval_trick_approach")
    return kmer_df

The first argument to list_eval_ref is either an Expr or a column name. In this case it's pl.int_ranges which is a list from 0 to n. The second argument is a function (in this case a lambda) whose first argument will represent your initial list. The second (and higher) argument represent your other column(s). If those argument names match the actual column names then you don't need to specify them to list_eval_ref. Instead, it will get them from inspecting the function's signature. If you don't have matching argument/column names then you have to tell list_eval_ref about those columns/expressions as its third (and higher) argument.

For example, above, it uses lambda i, sequence where the first argument's name doesn't matter (it will be the list) but sequence matches the column name from the df. If it were entered as lambda i, anything_else: (...) then it'd have to be followed by "sequence" as the third argument to list_eval_ref

Note: The function used here must return a pl.Expr, it can't run arbitrary python functions.

Performance

On my machine, with @rbierman's test setup, map_elements_approach took 17.02 seconds whereas eval_helper_approach took 7.2 seconds.

3 Comments

Strange, on my machine I get eval_helper_approach of 13.19 seconds, and map_elements_approach of 6.17 seconds
That is really weird, I'm on Windows WSL Ubuntu. What about you?
MacOS Mac Air M2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.