1

I have got a gene bank file .gbk from which I want to extract certain genes. My problem is the following: In order to process the file, the header for each locus must be in a specific format, and it is not in my file. I want to parse the file and replace the headers as following:

LOCUS       NODE_1_length_393688_cov_17.8554393688 bp   DNA linear
BCT22-MAY-2017
DEFINITION  Escherichia coli strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
....
>>Gene data here
....

LOCUS       NODE_2_length_278889_cov_17.85545278889 bp   DNA linear
BCT22-MAY-2017
DEFINITION  Escherichia coli strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
....
>>Gene data here
....

LOCUS       NODE_3_length_340008_cov_17.855432340008 bp   DNA linear
BCT22-MAY-2017
DEFINITION  Escherichia coli strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
....
>>Gene data here
....

The string commencing with NODE is too long for the file format convention and needs to be replaced so it looks like that:

LOCUS       NODE_1_393688 bp   DNA linear
....
LOCUS       NODE_2_278889 bp   DNA linear
....
LOCUS       NODE_3_340008 bp   DNA linear

The part that needs to be cut out is not necessary of the same lenght, so a fixed approach removing everything between certain positions of the string is not feasible. I have tried different approaches using re.compile() and r.sub() but have not been successful so far.

Any help would be highly appreciated. Thank you for your time!

1 Answer 1

1

When you read the first line, you can read the fields and normalize the "node" field, as follow:

import operator

def normalize_name(name):
    parts = name.split("_")
    return "_".join(operator.itemgetter(0, 1, 3)(parts))

It splits the field name into parts; you get a list. Then, the operator.itemgetter(0, 1, 3) function, applied on parts will extract the items at index 0, 1 and 3, skipping the 2.

For instance:

for name in [
    "NODE_1_length_393688_cov_17.8554393688",
    "NODE_2_length_278889_cov_17.85545278889",
    "NODE_3_length_340008_cov_17.855432340008"
    ]:
    print(normalize_name(name))

You get:

NODE_1_393688
NODE_2_278889
NODE_3_340008

Demo

import operator
import textwrap


get_parts = operator.itemgetter(0, 1, 3)


def normalize_name(name):
    parts = name.split("_")
    return "_".join(get_parts(parts))


def normalize_header(header):
    fields = header.split()
    fields[1] = normalize_name(fields[1])
    return "{0:<11} {1} {2:<4} {3} {4}".format(*fields)


content = textwrap.dedent("""\
LOCUS       NODE_1_length_393688_cov_17.8554393688 bp   DNA linear
BCT22-MAY-2017
DEFINITION  Escherichia coli strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
....
>>Gene data here
....
""")

for line in content.splitlines():
    if line.startswith("LOCUS"):
        line = normalize_header(line)
    print(line)
Sign up to request clarification or add additional context in comments.

1 Comment

OK thanks, that sounds good, I just need to implement this into a loop that parses multiple files in different subdirectories and amends all instances of a line beginning with 'LOCUS' in each file. In order to do that, I need instead of filling in the text into the content = textwrap.dedent function read in the whole file itself. Can textwrap.dedent do that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.