4

I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.

Here's an example of the columns of interest.

     motif_len    sequence
0    3            ATTATTATTATT
1    4            ATCTATCTATCT
2    3            ATCATCATCATC

I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.

     motif_len    sequence        motif
0    3            ATTATTATTATT    ATT
1    4            ATCTATCTATCT    ATCT
2    3            ATCATCATCATC    ATC

I've tried a few things with no luck.

>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]

Both make the motif column but all the values are NaN.

I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.

I also tried to create a series by iterating through each Any ideas?

1 Answer 1

4

You can call apply on the df pass axis=1 to apply row-wise and use the column values to slice the str:

In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df

Out[5]:
   motif_len      sequence motif
0          3  ATTATTATTATT   ATT
1          4  ATCTATCTATCT  ATCT
2          3  ATCATCATCATC   ATC
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Provides the exact result I was looking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.