Is there a better way to parse a file in python?

Question

I am looking for some better way to parse a huge file. Following is the example of the file.

sample.txt

'abcdefghi'
'xyzwfg'
'lmnop'

Out of which I am looking for 'abc' and 'xyz' in the file at least once

I was able to find them but I am looking for some better way. Following is my code

datafile = file('sample.txt')
abc = 0
xyz = 0
found - True

for line in datafile:
        if 'abc' in line:
            abc += 1
            break    
for line in datafile:
        if 'xyz' in line:
            xyz += 1
            break

if (abc + xyz) >= 2:
    print 'found'
else:
    print 'fail'

I am running a loop twice. So is there a better way to parse the file?

Do you care about the total number of occurrences found? Your use of a counter instead of a true/false flag suggests yes, but the use of break suggests no. — John Gordon
– John Gordon, Commented Feb 15, 2016 at 21:04

Steven Rumbalski · Accepted Answer · 2016-02-15 21:05:06Z

2

Your current code will produce incorrect results if you 'xyz' occurs before 'abc'. To fix this test for both in the same loop.

with open('sample.txt') as datafile:
    abc_found = False
    xyz_found = False

    for line in datafile:
        if 'abc' in line:
            abc_found = True
        if 'xyz' in line:
            xyz_found = True
        if abc_found and xyz_found: 
            break # stop looking if both found

answered Feb 15, 2016 at 21:05

Steven Rumbalski

45.8k10 gold badges96 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Steven Rumbalski Over a year ago

@FredrikRosenqvist: He does not seek to 0 or close and reopen the file, so the second loop continues reading the file where the first one left off.

Nathaniel Ford · Accepted Answer · 2016-02-15 21:02:47Z

0

"Better" is subjective and there are no metrics provided to define it. However, a simple optimization is the following:

for line in datafile:
    if 'abc' in line:
        abc += 1
    if 'xyz' in line:
        xyz += 1

If the actual problem is that the file is indeed very large, you want to only read one line at a time:

f = open('myTextFile.txt', "r")
line = f.readline()
while line:
    if 'abc' in line:
        abc += 1
    if 'xyz' in line:
        xyz += 1
    line = f.readline()

The result of this would be to get the number of lines in which abc and xyz occurred, respectively. If the idea is to quit as soon as you find a single matching line, then including the break is appropriate.

edited Feb 15, 2016 at 21:02

answered Feb 15, 2016 at 21:01

Nathaniel Ford

21.3k20 gold badges98 silver badges112 bronze badges

3 Comments

John Gordon Over a year ago

This is exactly what I was going to answer. However note that this will process the whole file, where the original code stops looking after one occurrence is found.

Nathaniel Ford Over a year ago

True... though it's unclear from the original question if that's intentional. If so, why use +=?

Steven Rumbalski Over a year ago

The idiomatic way to read a file line by line is for line in f:. No need for the awkward while loop and explicit calls to f.readline().

Collectives™ on Stack Overflow

Is there a better way to parse a file in python?

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related