1

So I'm trying to create a regex clause where it can detect any combination of 'ACTG' characters, and accept that as valid. While anything else - including a combination of 'ACTG' and some other characters are invalid.

Ultimately, I'll take it out of the while loop, that's just for testing purposes. Right now I believe as long as it starts with either a, c, t, or g, it says it's valid....

Is there a funcition in regex that would be better suited than match?

import re
while (True):
    DnaString = str(input('enter your polynucleotide chain code hooblah'))
    if (re.match('([ACTG]+[ACTG]*)', DnaString, flags=0)):
        #if re.search('^ACTG', DnaString) != -1: 
            print ("valid chain.")
    else: #(re.search('^[ACTG]+[ACTG]*$', DnaString) == -1):
        print("invalid chain, please check your input.")

    if (DnaString.find("end") != -1):
        print("ohokaybye.")
        break
1
  • Does you code do what you want it to do? Are you having trouble using match()? Commented Mar 8, 2016 at 4:00

4 Answers 4

3

Why not just

if all(c in 'ACGT' for c in DnaString):
    # Do success
else:
    # Do failure
Sign up to request clarification or add additional context in comments.

2 Comments

so much simpler! thank you. all is still a reg ex function?
all returns true if every value in the iterable (in this case, the generator (c in 'ACGT' for c in DnaString)) evaluates to true, otherwise it returns false.
2

Your problem is that you are just searching for the ACTG characters anywhere in the string without specifying that nothing else is permitted. If you change your regex to ^[ACTG]+$ then it will work as expected. The ^ and $ characters are anchors which mean the start and end of the line, respectively.

So the regex above matches a string which contains one or more of the four characters and doesn't allow any other characters either before or after them.

3 Comments

i had tried that in the beginning! let me try again. Thanks!
edit: tried it, now works! not sure what changed in the middle. now i have : (re.match('^([ACTG]+[ACTG]*)$', DnaString, flags=0))
That works, although the [ACTG]* isn't doing anything. The [ACTG]+ is a greedy match for one or more of the four characters, which means that it matches as many as possible. After it has matched as many as possible, the [ACTG]* tries to match zero or more characters, finds zero (since they've already been matched by +) and finishes. So ^([ACTG]+)$ would work just as well
1

If you allow a match to internally repeat acceptable characters, then this might be what you want:

'[A|C|T|G]{4}'

1 Comment

Thanks! this seems to do the same as what i have now. If it's in the first four indexes, then it works, but if another character is used later, it doesn't detect that it's invalid
0

Without regular expressions:

  • Expanding on the original Code Only answer.
  • Use the membership operator, in
    • Cretes a generator of True or False values
  • Use the built-in python function all
    • Checks the generator, if they are all True, True is returned, otherwise False.
  • Use a generator expression, e.g. (base in bases for base in sequence)
bases = 'acgt'
sequence = (input('Input DNA sequence: ')).lower()

if all(base in bases for base in sequence):
    print('Input is correct')
else:
    print('Only allowed bases are A, T, C, G')

Output:

Input DNA sequence:  atcgggggcccccttttaaaa
Input is correct

Input DNA sequence:  atcgggggcccccttttaaaaf
Only allowed characters are A, T, C, G

Write a function:

  • Considering the length of a DNA sequence, realistically, noone is going to type one in.
def check_sequence(sequence: str):
    sequence = sequence.lower()
    bases = 'acgt'
    if all(base in bases for base in sequence):
        print('Input is correct')
    else:
        print('Only allowed characters are A, T, C, G')

my_sequence = 'gcaatgcAttGtgaaagagccGcTaCaacctaaacGctgcacgtcacctagagtgtCttgcgggTgaggccctctcgAacagattacagtaccgttatc'

check_sequence(my_sequence)

>>> Input is correct

Function that returns the sequence if it's not correct:

  • Use zip to combine iterables
def check_sequence(sequence: str) -> list:
    sequence = [base for base in sequence.lower()]
    base_pairs = 'acgt'
    matches = list(bases in base_pairs for bases in sequence)

    sequence_check = list(zip(sequence, matches))

    if all(matches):
        print('Input is correct')
    else:
        print('Only allowed characters are A, T, C, G')
        return sequence_check

Usage:

my_sequence = 'GcaatGcatfftgtgaaagAg'

verified_sequence = check_sequence(my_sequence)

print(verified_sequence)

# Output:
[('g', True),
 ('c', True),
 ('a', True),
 ('a', True),
 ('t', True),
 ('g', True),
 ('c', True),
 ('a', True),
 ('t', True),
 ('f', False),
 ('f', False),
 ('t', True),
 ('g', True),
 ('t', True),
 ('g', True),
 ('a', True),
 ('a', True),
 ('a', True),
 ('g', True),
 ('a', True),
 ('g', True)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.