Lets split on each word (whilst removing the initial whitespace) and create a stacked dataframe, there we can count each occurance using a cumulative count and drop all the duplicates whilst finally creating our tuple.
the key functions here will be explode, str_split, stack and drop_duplicates
s = (
df.stack()
.str.split("")
.explode()
.to_frame("words")
.replace("", np.nan, regex=True)
.dropna()
)
s['enum'] = s.groupby(level=[0,1]).cumcount()
df["diff"] = (
s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
]
.groupby("level_0")
.agg(words=("words", ",".join), pos=("enum", "first"))
.agg(tuple, axis=1)
)
print(df)
target read diff
0 AATGGCATC AATGGCATG (C,G, 8)
1 AATGATATA AAGGATATA (T,G, 2)
2 AATGATGTA CATGATGTA (A,C, 0)
print(s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])
level_0 words enum
target 0 C 8
read 0 G 8
target 1 T 2
read 1 G 2
target 2 A 0
read 2 C 0