The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:
First, we're lookig for any one of these strings:
- CATGTG
- CATTTG
- CACGTG
Second, the pattern is:
- string
- 6-7 letters
- string
Example
- match: CATGTGXXXXXXCACGTG
- no match: CATGTGXXXCACGTG (because 3 letters between)
Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.
Example:
input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX
workflow (spaces for readability):
- found match: CATGTG XXXXXX CATTTG
it starts at 3
resuming search at C in CATTTG
found match: CATTTG XXXXXXX CACGTG
- it starts at 15
and so on...
After a few hours of tinkering, my sorry attempt did not yield what I expected:
regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group())
3 CATGTG
15 CATTTG (incorrect)
You're a genius if you can figure this out with a RegEx. Thanks :D