2

for example for the tow columns

target        read
AATGGCATC     AATGGCATG
AATGATATA     AAGGATATA
AATGATGTA     CATGATGTA

I want to add the column

target        read       differnces
AATGGCATC     AATGGCATG  (C,G,8)
AATGATATA     AAGGATATA  (T,G,3)
AATGATGTA     CATGATGTA  (A,G,0)
0

2 Answers 2

2

Lets split on each word (whilst removing the initial whitespace) and create a stacked dataframe, there we can count each occurance using a cumulative count and drop all the duplicates whilst finally creating our tuple.

the key functions here will be explode, str_split, stack and drop_duplicates

s = (
    df.stack()
    .str.split("")
    .explode()
    .to_frame("words")
    .replace("", np.nan, regex=True)
    .dropna()
)

s['enum'] = s.groupby(level=[0,1]).cumcount()

df["diff"] = (
    s.reset_index(0)[
        ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
    ]
    .groupby("level_0")
    .agg(words=("words", ",".join), pos=("enum", "first"))
    .agg(tuple, axis=1)
)
                    

print(df)

     target       read      diff
0  AATGGCATC  AATGGCATG  (C,G, 8)
1  AATGATATA  AAGGATATA  (T,G, 2)
2  AATGATGTA  CATGATGTA  (A,C, 0)

print(s.reset_index(0)[
          ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])

        level_0 words  enum
target        0     C     8
read          0     G     8
target        1     T     2
read          1     G     2
target        2     A     0
read          2     C     0
Sign up to request clarification or add additional context in comments.

2 Comments

Nice answer @Datanovice.
@ShubhamSharma thanks just realised i made a mistake, just fixed it, strange as OP accepted the answer!
1

I think this simple function might help you (Keep in mind that this is not a vectorised way of doing it):

import pandas as pd
import difflib as dl

# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame

def differences(a,b):
    differences=[]
    for i in range(len(a)):
        l=list(dl.ndiff(a[i].strip(),b[i].strip()))
        temp=[x[2] for x in l if x[0]!=' ' ]
        for x in l:
            if x[0]=='-' or x[0]=='+':
                temp.append(l.index(x))
        differences.append(tuple(temp[:3]))
    return differences

df['differences']=differences(df['target'],df['read'])
print(df)

Output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.