1

The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:

First, we're lookig for any one of these strings:

  • CATGTG
  • CATTTG
  • CACGTG

Second, the pattern is:

  • string
  • 6-7 letters
  • string

Example

  • match: CATGTGXXXXXXCACGTG
  • no match: CATGTGXXXCACGTG (because 3 letters between)

Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.

Example:

  • input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX

  • workflow (spaces for readability):

    • found match: CATGTG XXXXXX CATTTG
    • it starts at 3

    • resuming search at C in CATTTG

    • found match: CATTTG XXXXXXX CACGTG

    • it starts at 15

and so on...

After a few hours of tinkering, my sorry attempt did not yield what I expected:

regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
   print(m.start(), m.group())

3 CATGTG
15 CATTTG (incorrect)

You're a genius if you can figure this out with a RegEx. Thanks :D

1
  • Can you post what you've tried and the desired output? Do you want a yes/no for the test strings etc. Commented Feb 17, 2017 at 19:33

2 Answers 2

2

You can use this kind of pattern:

import re

s='XXXCATGTGXXXXXXCATTTGXXXXXXXCACGTGXXX'

regex = re.compile(r'(?=(((?:CATGTG|CATTTG|CACGTG).{6,7}?)(?:CATGTG|CATTTG|CACGTG)))\2')

for m in regex.finditer(s):
    print(m.start(), m.group(1))

The idea is to put the whole string inside the lookahead and to use a backreference to consume characters you don't want to test after.

The first capture group contains the whole sequence, the second contains all characters until the next start position.

Note that you can change (?:CATGTG|CATTTG|CACGTG) to CA(?:TGTG|TTTG|CGTG) to improve the pattern.

Sign up to request clarification or add additional context in comments.

1 Comment

@WiktorStribiżew: no, with finditer I just add it.
0

The main issue is that in order to use the | character, you need to enclose the alternatives in parentheses.

Assuming from your example that you want only the first matching string, try the following:

regex = re.compile("(CATGTG|CATTTG|CACGTG).{6,7}(?:CATGTG|CATTTG|CACGTG)")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
    print(m.start(), m.group(1))

Note the .group(1), which will match only what's in the first set of parentheses, as opposed to .group() which will return the whole match.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.