Large TXT file Parsing Problem in python

Question

Been trying to figure this one out all day. I have a large text file (546 MB) that I am trying to parse in python looking to pull out the text between the open tag and the close tag and I keep getting memory problems. With the help of good folks on this board this is what I have so far.

answer = ''
output_file = open('/Users/Desktop/Poetrylist.txt','w')

with open('/Users/Desktop/2e.txt','r') as open_file:
    for each_line in open_file:
        if each_line.find('<A>'):
            start_position = each_line.find('<A>')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('</W>')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

output_file.close()

I am getting this error message:

Traceback (most recent call last):
  File "C:\Users\Adam\Desktop\OEDsearch3.py", line 9, in <module>
    end_position = each_line[start_position:].find('</W>')
MemoryError

I have little to no programming experience and I am trying to figure this out for a poetry project I am working on. Any help is greatly appreciated.

Sorry the code did not come out right. My apologies, I am feeling useless. — English Grad
– English Grad, Commented Jul 22, 2011 at 19:27
Is this an XML file? If so, consider using a library. ElementTree, lxml, BeautifulSoup, etc. — jterrace
– jterrace, Commented Jul 22, 2011 at 19:31
Sure. There are no end of line markers which I think is part of the problem. Here is a sample from the file: <E><HG><HL><LF>A</LF><SF>A</SF><MF>A</MF></HL> <MPR><i>e&mac.</i><su>i</su></MPR><IPR><IPH>e&shti.</IPH></IPR>, </HG><S0><S2><S4><S6><DEF>the first letter of the Roman Alphabet, and of its various subsequent modifications (as were its prototypes Alpha of the Greek, and Aleph of the Ph&oe.nician and old Hebrew); representing originally in English, as in Latin, the `low-back-wide' vowel, formed with the widest opening of jaws, pharynx, and lips. The plural has been written <CF>aes</CF> — English Grad
– English Grad, Commented Jul 22, 2011 at 19:32

Ned Batchelder · Accepted Answer · 2011-07-22 20:05:26Z

4

Your logic is wrong because .find() returns -1 if the string is not found, and -1 is a true-ish value, so your code will think every line has <A> in it.
You don't need to make a new substring to find the '</W>', because .find() also has an optional start argument.
Neither of these explain why you are running out of memory. Do you have an unusually small-memory machine?
Are you sure you're showing us all the code?

EDITED: OK, now I think your file only has one line in it.

Try changing your code like this:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Desktop/2e.txt','r') as open_file:
        the_whole_file = open_file.read()
        start_position = 0
        while True:
            start_position = the_whole_file.find('<A>', start_position)
            if start_position < 0:
                break
            start_position += 3
            end_position = the_whole_file.find('</W>', start_position)
            output_file.write(the_whole_file[start_position:end_position])
            output_file.write("\n")    
            start_position = end_position + 4

edited Jul 22, 2011 at 20:05

answered Jul 22, 2011 at 19:34

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

English Grad Over a year ago

Actually no, the opposite is true. My machine has 16 gigs of Ram, as I use it primarily as a music workstation. I have had 60 real-time audio tracks playing with multiple effects on each track.

English Grad Over a year ago

Ned that is all the code. What you wrote worked!! Can I just put this in a loop and run it over the whole file, because it returned only one value. Thanks for the help, I literally started programming one week ago. My experience is the python manual.

Ned Batchelder Over a year ago

Hmm, mysterious. As you can see, this does have a loop over all the lines in the file. It should do just what your original code did: pull out stuff between <A> and </W> in every line in the file.

English Grad Over a year ago

Yea that is what it look like to me too but it pulls out only one. There are thousands of tags in the file.

Ned Batchelder Over a year ago

Now I like @TokenMacGuy's theory: you are getting the entire file as one "line".

|

SingleNegationElimination · Accepted Answer · 2011-07-22 19:37:59Z

2

I think you might be running into a problem with line endings. iter(open_file) is supposed to return each line separately, but it might incorrectly guess at the line terminatior, which varies from os to os. You can get python to treat any line ending for any os as a line ending for the purposes of readlines/iter by adding a "U" to the flags to open. Try this:

with open('/Users/Desktop/2e.txt','rU') as open_file:
#                                   ^

with the rest all the same. (comment added for emphasis).

answered Jul 22, 2011 at 19:37

SingleNegationElimination

157k35 gold badges270 silver badges306 bronze badges

1 Comment

English Grad Over a year ago

Thanks Token, it still runs the same way. Produces one line of output and then terminates. I put a specific tag in the code which occurs in the file 46 times so I had something to test it with and it will only give me one line. I have no idea what is wrong

pomel · Accepted Answer · 2011-07-22 19:35:04Z

1

Are you sure you wont to use

if each_line.find(''):

find() returns -1 if substring is not found, thus even if you have no matches the clause will be true

answered Jul 22, 2011 at 19:35

pomel

4102 silver badges12 bronze badges

Collectives™ on Stack Overflow

Large TXT file Parsing Problem in python

3 Answers 3

9 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related