compare two strings in a pandas data frame and show difference

Question

for example for the tow columns

target        read
AATGGCATC     AATGGCATG
AATGATATA     AAGGATATA
AATGATGTA     CATGATGTA

I want to add the column

target        read       differnces
AATGGCATC     AATGGCATG  (C,G,8)
AATGATATA     AAGGATATA  (T,G,3)
AATGATGTA     CATGATGTA  (A,G,0)

Umar.H · Accepted Answer · 2020-06-23 14:43:56Z

2

Lets split on each word (whilst removing the initial whitespace) and create a stacked dataframe, there we can count each occurance using a cumulative count and drop all the duplicates whilst finally creating our tuple.

the key functions here will be explode, str_split, stack and drop_duplicates

s = (
    df.stack()
    .str.split("")
    .explode()
    .to_frame("words")
    .replace("", np.nan, regex=True)
    .dropna()
)

s['enum'] = s.groupby(level=[0,1]).cumcount()

df["diff"] = (
    s.reset_index(0)[
        ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
    ]
    .groupby("level_0")
    .agg(words=("words", ",".join), pos=("enum", "first"))
    .agg(tuple, axis=1)
)

print(df)

     target       read      diff
0  AATGGCATC  AATGGCATG  (C,G, 8)
1  AATGATATA  AAGGATATA  (T,G, 2)
2  AATGATGTA  CATGATGTA  (A,C, 0)

print(s.reset_index(0)[
          ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])

        level_0 words  enum
target        0     C     8
read          0     G     8
target        1     T     2
read          1     G     2
target        2     A     0
read          2     C     0

edited Jun 23, 2020 at 14:43

answered Jun 23, 2020 at 10:22

Umar.H

23.1k8 gold badges50 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shubham Sharma Over a year ago

Nice answer @Datanovice.

Umar.H Over a year ago

@ShubhamSharma thanks just realised i made a mistake, just fixed it, strange as OP accepted the answer!

Sahaj Adlakha · Accepted Answer · 2020-06-23 10:57:45Z

1

I think this simple function might help you (Keep in mind that this is not a vectorised way of doing it):

import pandas as pd
import difflib as dl

# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame

def differences(a,b):
    differences=[]
    for i in range(len(a)):
        l=list(dl.ndiff(a[i].strip(),b[i].strip()))
        temp=[x[2] for x in l if x[0]!=' ' ]
        for x in l:
            if x[0]=='-' or x[0]=='+':
                temp.append(l.index(x))
        differences.append(tuple(temp[:3]))
    return differences

df['differences']=differences(df['target'],df['read'])
print(df)

answered Jun 23, 2020 at 10:57

Sahaj Adlakha

1563 bronze badges

Collectives™ on Stack Overflow

compare two strings in a pandas data frame and show difference

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related