Separating a text document by specific lines of text using python

Question

I'm writing a python function to take a chunk of text, parsed from a text file using f.readlines and split this chunk of text into a list. The text contains dividers and I want to split this text specifically at these locations. Below is an example of the text file in question.

@model:2.4.0=Skeleton "Skeleton"
@compartments
 Cell=1.0 "Cell"
@species
 Cell:[A]=100.0 "A"
 Cell:[B]=1.0 "B"
 Cell:[C]=0.0 "C"
 Cell:[D]=0.0 "D"
@parameters
kcat=4000
km = 146
v2_k = 88
@reactions
@r=v1 "v1"
 A -> C : B
 Cell * kcat * B * A / (km + A) 
@r=v2 "v2"
 C -> C+D
 Cell * v2_k * C

My desired output is to have a python dictionary that has the name of the dividers as keys and all the content between that divider and the next as values. For example, the first element of the sections dictionary should be:

sections['@model']=:2.4.0=Skeleton "Skeleton"

Current Code

def split_sections(SBshorthand_file):
    '''
    Takes a SBshorthand file and returns a dictionary of each of the sections. 
    Keys of the dictionary are the dividers.
    Values of dictionary are the content between dividers. 
    '''
    SBfile=parse_SBshorthand_read(SBshorthand_file) #simple parsing function. uses f.read()
    dividers=["@model", "@units", "@compartments", "@species", "@parameters", "@rules", "@reactions", "@events"]
    sections={}
    for i in  dividers:
        pattern=re.compile(i)
        if re.findall(pattern,SBfile) == []:
            pass
#            print 'Section \'{}\' not present in {}'.format(i,SBshorthand_file)
        else:
            SBfile2=re.sub(pattern,'\n'+i,SBfile)
            print SBfile2

This however does not do what I want. Would anybody have any ideas how to fix my code? Thanks

-----------------Edit--------------------

Please note that the section '@reactions' contains a number of 'reactions' all of which start with @r, but they all need to be grouped under the reactions key.

vks · Accepted Answer · 2015-10-19 04:38:43Z

1

import re

x="""@model:2.4.0=Skeleton "Skeleton"
@compartments
Cell=1.0 "Cell"
@species
Cell:[A]=100.0 "A"
Cell:[B]=1.0 "B"
Cell:[C]=0.0 "C"
Cell:[D]=0.0 "D"
@parameters
kcat=4000
km = 146
v2_k = 88
@reactions
@r=v1 "v1"
A -> C : B
Cell * kcat * B * A / (km + A)
@r=v2 "v2"
C -> C+D
Cell * v2_k * C"""


print dict(re.findall(r"(?:^|(?<=\n))(@\w+)([\s\S]*?)(?=\n@(?!r\b)\w+|$)",x))

You can directly use re.findall and get what you want.

edited Oct 19, 2015 at 4:38

answered Oct 15, 2015 at 12:00

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

CiaranWelsh Over a year ago

Sorry, I accidentally edited your post instead of the question and it wont let me change it back. Your code works but would you happen to know to exclude the '@r' tag (as I've explained in the edit). Thanks

ergonaut · Accepted Answer · 2015-10-15 12:02:48Z

1

You can use capture groups as follows:

re.findall(r"(?s)(@.*?)[\s:]\s+(.*?)(?=[@$])");

demo

where capture group1 matches the key
capture group2 matches the value

answered Oct 15, 2015 at 12:02

ergonaut

7,0671 gold badge21 silver badges50 bronze badges

Collectives™ on Stack Overflow

Separating a text document by specific lines of text using python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related