I am curious whether I am missing something in the Polars Expression library in how this could be done more efficiently. I have a dataframe of protein sequences, where I would like to create k-long substrings from the protein sequences, like the kmerize function below.
def kmerize(sequence, ksize):
kmers = [sequence[i : (i + ksize)] for i in range(len(sequence) - ksize + 1)]
return kmers
Within Polars, I did a group_by on the sequence, where within map_groups, the sequence was repeated by its length and exploded, then a row index was added. This row index was used to slice the sequences into k-mers, and then filtered by only keeping k-mers of the correct size.
Here is a minimally reproducible example:
from io import StringIO
import polars as pl
s = """sequence_name,sequence,length
sp|O43236|SEPT4_HUMAN Septin-4 OS=Homo sapiens OX=9606 GN=SEPTIN4 PE=1 SV=1,MDRSLGWQGNSVPEDRTEAGIKRFLEDTTDDGELSKFVKDFSGNASCHPPEAKTWASRPQVPEPRPQAPDLYDDDLEFRPPSRPQSSDNQQYFCAPAPLSPSARPRSPWGKLDPYDSSEDDKEYVGFATLPNQVHRKSVKKGFDFTLMVAGESGLGKSTLVNSLFLTDLYRDRKLLGAEERIMQTVEITKHAVDIEEKGVRLRLTIVDTPGFGDAVNNTECWKPVAEYIDQQFEQYFRDESGLNRKNIQDNRVHCCLYFISPFGHGLRPLDVEFMKALHQRVNIVPILAKADTLTPPEVDHKKRKIREEIEHFGIKIYQFPDCDSDEDEDFKLQDQALKESIPFAVIGSNTVVEARGRRVRGRLYPWGIVEVENPGHCDFVKLRTMLVRTHMQDLKDVTRETHYENYRAQCIQSMTRLVVKERNRNKLTRESGTDFPIPAVPPGTDPETEKLIREKDEELRRMQEMLHKIQKQMKENY,478
sp|O43521|B2L11_HUMAN Bcl-2-like protein 11 OS=Homo sapiens OX=9606 GN=BCL2L11 PE=1 SV=1,MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH,198
sp|O60238|BNI3L_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 3-like OS=Homo sapiens OX=9606 GN=BNIP3L PE=1 SV=1,MSSHLVEPPPPLHNNNNNCEENEQSLPPPAGLNSSWVELPMNSSNGNDNGNGKNGGLEHVPSSSSIHNGDMEKILLDAQHESGQSSSRGSSHCDSPSPQEDGQIMFDVEMHTSRDHSSQSEEEVVEGEKEVEALKKSADWVSDWSSRPENIPPKEFHFRHPKRSVSLSMRKSGAMKKGGIFSAEFLKVFIPSLFLSHVLALGLGIYIGKRLSTPSASTY,219
sp|O95197|RTN3_HUMAN Reticulon-3 OS=Homo sapiens OX=9606 GN=RTN3 PE=1 SV=2,MAEPSAATQSHSISSSSFGAEPSAPGGGGSPGACPALGTKSCSSSCADSFVSSSSSQPVSLFSTSQEGLSSLCSDEPSSEIMTSSFLSSSEIHNTGLTILHGEKSHVLGSQPILAKEGKDHLDLLDMKKMEKPQGTSNNVSDSSVSLAAGVHCDRPSIPASFPEHPAFLSKKIGQVEEQIDKETKNPNGVSSREAKTALDADDRFTLLTAQKPPTEYSKVEGIYTYSLSPSKVSGDDVIEKDSPESPFEVIIDKAAFDKEFKDSYKESTDDFGSWSVHTDKESSEDISETNDKLFPLRNKEAGRYPMSALLSRQFSHTNAALEEVSRCVNDMHNFTNEILTWDLVPQVKQQTDKSSDCITKTTGLDMSEYNSEIPVVNLKTSTHQKTPVCSIDGSTPITKSTGDWAEASLQQENAITGKPVPDSLNSTKEFSIKGVQGNMQKQDDTLAELPGSPPEKCDSLGSGVATVKVVLPDDHLKDEMDWQSSALGEITEADSSGESDDTVIEDITADTSFENNKIQAEKPVSIPSAVVKTGEREIKEIPSCEREEKTSKNFEELVSDSELHQDQPDILGRSPASEAACSKVPDTNVSLEDVSEVAPEKPITTENPKLPSTVSPNVFNETEFSLNVTTSAYLESLHGKNVKHIDDSSPEDLIAAFTETRDKGIVDSERNAFKAISEKMTDFKTTPPVEVLHENESGGSEIKDIGSKYSEQSKETNGSEPLGVFPTQGTPVASLDLEQEQLTIKALKELGERQVEKSTSAQRDAELPSEEVLKQTFTFAPESWPQRSYDILERNVKNGSDLGISQKPITIRETTRVDAVSSLSKTELVKKHVLARLLTDFSVHDLIFWRDVKKTGFVFGTTLIMLLSLAAFSVISVVSYLILALLSVTISFRIYKSVIQAVQKSEEGHPFKAYLDVDITLSSEAFHNYMNAAMVHINRALKLIIRLFLVEDLVDSLKLAVFMWLMTYVGAVFNGITLLILAELLIFSVPIVYEKYKTQIDHYVGIARDQTKSIVEKIQAKLPGIAKKKAE,1032
sp|P10415|BCL2_HUMAN Apoptosis regulator Bcl-2 OS=Homo sapiens OX=9606 GN=BCL2 PE=1 SV=2,MAHAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPVARTSPLQTPAAPGAAAGPALSPVPPVVHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK,239"""
ksize = 24
df = pl.scan_csv(StringIO(s))
df.group_by("sequence").map_groups(
lambda group_df: group_df.with_columns(kmers=pl.col("sequence").repeat_by("length"))
.explode("kmers")
.with_row_index(),
schema={"index": pl.UInt32, "sequence_name": pl.String, "sequence": pl.String, "length": pl.Int64, "kmers": pl.String},
).with_columns(pl.col("kmers").str.slice("index", ksize)).filter(
pl.col("kmers").str.len_chars() == ksize
).rename(
{"index": "start"}
).collect()
This code produces this dataframe:
Is there a more efficient way to do this in Polars? I will be using dataframes with ~250k sequences, each ~100-1000 letters long, so I'd like to do this as low-resource as possible.
Thank you and have a beautiful day!
