0

I am looking for some better way to parse a huge file. Following is the example of the file.

sample.txt

'abcdefghi'
'xyzwfg'
'lmnop'

Out of which I am looking for 'abc' and 'xyz' in the file at least once

I was able to find them but I am looking for some better way. Following is my code

datafile = file('sample.txt')
abc = 0
xyz = 0
found - True

for line in datafile:
        if 'abc' in line:
            abc += 1
            break    
for line in datafile:
        if 'xyz' in line:
            xyz += 1
            break

if (abc + xyz) >= 2:
    print 'found'
else:
    print 'fail'

I am running a loop twice. So is there a better way to parse the file?

1
  • 1
    Do you care about the total number of occurrences found? Your use of a counter instead of a true/false flag suggests yes, but the use of break suggests no. Commented Feb 15, 2016 at 21:04

2 Answers 2

2

Your current code will produce incorrect results if you 'xyz' occurs before 'abc'. To fix this test for both in the same loop.

with open('sample.txt') as datafile:
    abc_found = False
    xyz_found = False

    for line in datafile:
        if 'abc' in line:
            abc_found = True
        if 'xyz' in line:
            xyz_found = True
        if abc_found and xyz_found: 
            break # stop looking if both found
Sign up to request clarification or add additional context in comments.

1 Comment

@FredrikRosenqvist: He does not seek to 0 or close and reopen the file, so the second loop continues reading the file where the first one left off.
0

"Better" is subjective and there are no metrics provided to define it. However, a simple optimization is the following:

for line in datafile:
    if 'abc' in line:
        abc += 1
    if 'xyz' in line:
        xyz += 1

If the actual problem is that the file is indeed very large, you want to only read one line at a time:

f = open('myTextFile.txt', "r")
line = f.readline()
while line:
    if 'abc' in line:
        abc += 1
    if 'xyz' in line:
        xyz += 1
    line = f.readline()

The result of this would be to get the number of lines in which abc and xyz occurred, respectively. If the idea is to quit as soon as you find a single matching line, then including the break is appropriate.

3 Comments

This is exactly what I was going to answer. However note that this will process the whole file, where the original code stops looking after one occurrence is found.
True... though it's unclear from the original question if that's intentional. If so, why use +=?
The idiomatic way to read a file line by line is for line in f:. No need for the awkward while loop and explicit calls to f.readline().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.