Python: regex findall for subcategories?

Question

Following this question , I was thinking of including one more level of heirarchy to the string. For example this is my string:

sometext
somemore    text here

some  other text

              course: course1

some details
TestName: test1
some other details
Id              Name                marks
____________________________________________________
1               student1            65
2               student2            75
3               MyName              69
4               student4            43

some details
TestName: test3
some other details
Id              Name                marks
____________________________________________________
1               student1            23
3               MyName              63
4               student4            64


              course: course2

some details
TestName: test2
some other details
Id              Name                marks
____________________________________________________
1               student1            84
2               student3            73

some details
TestName: test5
some other details
Id              Name                marks
____________________________________________________
1               MyName              84
2               student2            73


              course: course4

some details
TestName: test1
some other details
Id              Name                marks
____________________________________________________
1               student1            58
2               student3            89

some details
TestName: test2
some other details
Id              Name                marks
____________________________________________________
1               student1            97
3               MyName              60
8               student6            82

and I want to get the details of MyName. An output like (course1,test1,69),(course1,test3,63),(course2,test5,84),(course4,test2,60) or similar output.

I was unable to do it in a single step, and hence came up with this:

import re
eachcourse = re.split(r'course: \w+',string1)
courselist = re.findall(r'course: (\w+)',string1)
li =[]
for i,course in enumerate(courselist):
    match = re.findall(r".*?TestName: (\w+)(?:(?!\TestName\b).)*MyName\s+(\d+).*?",eachcourse[i+1],re.DOTALL)
    li.append((course,match))
print li

which gives me

[('course1', [('test1', '69'), ('test3', '63')]), ('course2', [('test5', '84')]), ('course4', [('test2', '60')])]

Is there a better and cleaner way?

Thanks.

vks · Accepted Answer · 2015-06-04 10:07:34Z

1

x=re.findall(r"\bcourse: (\w+)(.*?)(?=(?:\bcourse:|$))",x,flags=re.DOTALL)


print [[i[0]]+re.findall(r"TestName: (\w+)(?:(?!\bTestName\b).)*MyName\s*(\d+)",i[1],flags=re.DOTALL) for i in x]

You can try this.Though the format is not exactly same ,it is usable.

answered Jun 4, 2015 at 10:07

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Deepa Over a year ago

Wonderful!! Thanks a lot! Just, one more doubt. Is this way preferable when Iam using very large strings? say 25 pages long data? I notice that the time it takes to return a result depends on the length of string ofcourse, and also the number of occurences of MyName in the string. It varies from 0.05 secs to 50 secs based on the number of occurences of MyName For Eg. 18 occurences in a 25 page string takes 0.05 secs and 1 occurence takes 50.2 secs . Just need an advice on whether this is the best possible way ?

vks Over a year ago

@Deepa this should work but regex generally does not give good performance.Best method could to parse through csv or some other parser :)

Deepa Over a year ago

Oh ok thanks! just one more clarification please. Supposing I need to retrieve details of say two students, then I need to repeat this for the second name right?

vks Over a year ago

@Deepa yeah right..you can store name in variable and make regex on the fly

Deepa Over a year ago

Iam really greatful for all your help. Thanks a lot! :)

Collectives™ on Stack Overflow

Python: regex findall for subcategories?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related