0

My code consists of 4 lists splitinputString1, splitinputString2, splitinputString3, and mainlistsplit. The list mainlistsplit is much longer as it contains all possible outcomes of the 4 letters A,C,T,&. The other 3 lists consist of predetermined 10 letter input strings that have been split into 4 letter strings.

My goal is to find 4 letter strings from the mainlistsplit that exist in each of the 3 input strings at the same time. I also have to allow for the input strings to have a 1 letter mismatch minimum. For example: ACTG in main and ACTC in one of the input strings.

I have tried the def is_close_match() but I am sure I am missing something slight in my code I am just not sure what that is.

My question is how should i go about comparing each of these string lists, finding the strings that match with at most 1 mismatch, returning, and printing them

import itertools

# Creates 3 lists, one with each of the input strings
lst = ['A', 'C', 'T', 'G', 'A', 'C', 'G', 'C', 'A', 'G']
lst2 = ['T', 'C', 'A', 'C', 'A', 'A', 'C', 'G', 'G', 'G']
lst3 = ['G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'T', 'T']

mainlist = ['A', 'C', 'T', 'G']
mainlistsplit = [''.join(i) for i in itertools.product(mainlist, repeat=4)]
# Function to  make all possible length 4 combos of mainList


# lists for the input strings when they are split
splitinputString1 = []
splitinputString2 = []
splitinputString3 = []

sequence_size = 4

# Takes the first 4 values of my lst, lst2, lst3, appends it to my split input strings, then increases the sequence by 1
for i in range(len(lst) - sequence_size + 1):
    sequence = ''.join(lst[i: i + 4])
    splitinputString1.append(sequence)

for i in range(len(lst2) - sequence_size + 1):
    sequence = ''.join(lst2[i: i + 4])
    splitinputString2.append(sequence)

for i in range(len(lst3) - sequence_size + 1):
    sequence = ''.join(lst3[i: i + 4])
    splitinputString3.append(sequence)

found = []


def is_close_match(mainlistsplit, s2):
    mismatches = 0
    for i in range(0, len(mainlistsplit)):
        if mainlistsplit[i] != s2[i]:
            mismatches += 1
        else:
            found = ''.join(s2)

    if mismatches > 1:
        return False
    else:
        return True
3
  • What's the question? Commented Aug 29, 2019 at 3:28
  • 1
    How to ask a good question Commented Aug 29, 2019 at 3:37
  • @drum Last lines before code Commented Aug 29, 2019 at 3:45

2 Answers 2

1

If I've got the question right, you could check if two strings are close with something like this:

def is_close_match(string1, string2):
  # 'string1' and 'string2' are assumed to have same length.
  return [c1 == c2 for c1, c2 in zip(string1, string2)].count(False) <= 1

where you count the number of characters that are not equals.

# 1 difference
print(is_close_match('ACTG', 'ACTC'))
# True

# no differences
print(is_close_match('ACTG', 'ACTG'))
# True

# 2 differences
print(is_close_match('ACTG', 'AGTC'))
# False

Then you can use is_close_match to filter you input lists and check if all the outputs have at least one element:

allLists = (
  splitinputString1,
  splitinputString2,
  splitinputString3,
)

for code in mainlistsplit:
  matches = [filter(lambda x: is_close_match(x, code), inputList)
             for inputList in allLists]
  if all(matches):
    print('Found {}: {}'.format(code, matches))
Sign up to request clarification or add additional context in comments.

Comments

0

Check this out.

import itertools
import difflib

# Creates 3 lists, one with each of the input strings
lst = ['A', 'C', 'T', 'G', 'A', 'C', 'G', 'C', 'A', 'G']
lst2 = ['T', 'C', 'A', 'C', 'A', 'A', 'C', 'G', 'G', 'G']
lst3 = ['G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'T', 'T']

mainlist = ['A', 'C', 'T', 'G']
mainlistsplit = [''.join(i) for i in itertools.product(mainlist, repeat=4)]

# Function to  make all possible length 4 combos of mainList


# lists for the input strings when they are split
splitinputString1 = []
splitinputString2 = []
splitinputString3 = []

sequence_size = 4

# Takes the first 4 values of my lst, lst2, lst3, appends it to my split input strings, then increases the sequence by 1
for i in range(len(lst) - sequence_size + 1):
    sequence = ''.join(lst[i: i + 4])
    splitinputString1.append(sequence)

for i in range(len(lst2) - sequence_size + 1):
    sequence = ''.join(lst2[i: i + 4])
    splitinputString2.append(sequence)

for i in range(len(lst3) - sequence_size + 1):
    sequence = ''.join(lst3[i: i + 4])
    splitinputString3.append(sequence)


def is_close_match(mainlistitem, lists):
    """
    Parsing full matched and sub matched items from the sub lists
    :param mainlistitem:
    :param lists:
    :return:
    """
    found = []
    partial_matched = []

    # Getting the partially matched items from a 4 letter string,
    # matching 75% (means 3 characters matches out of 4)
    for group in lists:
        partial_matched.extend(list(map(lambda x: difflib.get_close_matches(x, mainlistitem, cutoff=0.75), group)))
    found.extend(list(itertools.chain.from_iterable(partial_matched)))

    # Getting fully matched items from the 4 letter main string list.
    found.extend([i for group in lists for i in mainlistitem if i in group])
    return set(found)  # removing the duplicate matches in both cases


matching_list = is_close_match(mainlistsplit, [splitinputString1, splitinputString2, splitinputString3])
print(matching_list)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.