3

Been trying to figure this one out all day. I have a large text file (546 MB) that I am trying to parse in python looking to pull out the text between the open tag and the close tag and I keep getting memory problems. With the help of good folks on this board this is what I have so far.

answer = ''
output_file = open('/Users/Desktop/Poetrylist.txt','w')

with open('/Users/Desktop/2e.txt','r') as open_file:
    for each_line in open_file:
        if each_line.find('<A>'):
            start_position = each_line.find('<A>')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('</W>')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

output_file.close()

I am getting this error message:

Traceback (most recent call last):
  File "C:\Users\Adam\Desktop\OEDsearch3.py", line 9, in <module>
    end_position = each_line[start_position:].find('</W>')
MemoryError

I have little to no programming experience and I am trying to figure this out for a poetry project I am working on. Any help is greatly appreciated.

4
  • Sorry the code did not come out right. My apologies, I am feeling useless. Commented Jul 22, 2011 at 19:27
  • 1
    Can you give us a few example lines of txt? Commented Jul 22, 2011 at 19:29
  • 1
    Is this an XML file? If so, consider using a library. ElementTree, lxml, BeautifulSoup, etc. Commented Jul 22, 2011 at 19:31
  • 1
    Sure. There are no end of line markers which I think is part of the problem. Here is a sample from the file: <E><HG><HL><LF>A</LF><SF>A</SF><MF>A</MF></HL> <MPR><i>e&mac.</i><su>i</su></MPR><IPR><IPH>e&shti.</IPH></IPR>, </HG><S0><S2><S4><S6><DEF>the first letter of the Roman Alphabet, and of its various subsequent modifications (as were its prototypes Alpha of the Greek, and Aleph of the Ph&oe.nician and old Hebrew); representing originally in English, as in Latin, the `low-back-wide' vowel, formed with the widest opening of jaws, pharynx, and lips. The plural has been written <CF>aes</CF> Commented Jul 22, 2011 at 19:32

3 Answers 3

4
  1. Your logic is wrong because .find() returns -1 if the string is not found, and -1 is a true-ish value, so your code will think every line has <A> in it.

  2. You don't need to make a new substring to find the '</W>', because .find() also has an optional start argument.

  3. Neither of these explain why you are running out of memory. Do you have an unusually small-memory machine?

  4. Are you sure you're showing us all the code?

EDITED: OK, now I think your file only has one line in it.

Try changing your code like this:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Desktop/2e.txt','r') as open_file:
        the_whole_file = open_file.read()
        start_position = 0
        while True:
            start_position = the_whole_file.find('<A>', start_position)
            if start_position < 0:
                break
            start_position += 3
            end_position = the_whole_file.find('</W>', start_position)
            output_file.write(the_whole_file[start_position:end_position])
            output_file.write("\n")    
            start_position = end_position + 4
Sign up to request clarification or add additional context in comments.

9 Comments

Actually no, the opposite is true. My machine has 16 gigs of Ram, as I use it primarily as a music workstation. I have had 60 real-time audio tracks playing with multiple effects on each track.
Ned that is all the code. What you wrote worked!! Can I just put this in a loop and run it over the whole file, because it returned only one value. Thanks for the help, I literally started programming one week ago. My experience is the python manual.
Hmm, mysterious. As you can see, this does have a loop over all the lines in the file. It should do just what your original code did: pull out stuff between <A> and </W> in every line in the file.
Yea that is what it look like to me too but it pulls out only one. There are thousands of tags in the file.
Now I like @TokenMacGuy's theory: you are getting the entire file as one "line".
|
2

I think you might be running into a problem with line endings. iter(open_file) is supposed to return each line separately, but it might incorrectly guess at the line terminatior, which varies from os to os. You can get python to treat any line ending for any os as a line ending for the purposes of readlines/iter by adding a "U" to the flags to open. Try this:

with open('/Users/Desktop/2e.txt','rU') as open_file:
#                                   ^

with the rest all the same. (comment added for emphasis).

1 Comment

Thanks Token, it still runs the same way. Produces one line of output and then terminates. I put a specific tag in the code which occurs in the file 46 times so I had something to test it with and it will only give me one line. I have no idea what is wrong
1

Are you sure you wont to use

if each_line.find(''):

find() returns -1 if substring is not found, thus even if you have no matches the clause will be true

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.