0

This is the code I have, to count the frequency

import collections
import codecs
import io
from collections import Counter
with io.open('Combine.txt', 'r', encoding='utf8') as infh:
    words =infh.read().split()
    with open('Counts2.txt', 'wb') as f:
        for word, count in Counter(words).most_common(100000000):
            f.write(u'{} {}\n'.format(word, count).encode('utf-8')) 

When I try to read a big file( 4 GB) I am getting error

Traceback (most recent call last):
  File "counter.py", line 7, in <module>
    words =infh.read().split()
  File "/usr/lib/python2.7/codecs.py", line 296, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

I am using Ubuntu 12.4, 8 GB RAM Intel Core i7 How to fix this error ? /

usr/lib/python2.7/codecs.py", line 296, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    MemoryError
4

2 Answers 2

2

This is the pythonic way to process a file line-by-line:

with open(...) as fh:
    for line in fh:
        pass

This will take care of opening and closing the file, including if an exception is raised in the inner block, plus it treats the file object fh as an iterable, which automatically uses buffered I/O and manages memory so you don't have to worry about large files.

Sign up to request clarification or add additional context in comments.

3 Comments

What if all the words are on a single line?
It should be trivial to either: a) convert it to one-word-per-line via your shell or b) read from a file in chunks (ie. manually manage memory) and process accordingly.
@MichaelFoukarakis errors is at usr/lib/python2.7/codecs.py", line 296, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) MemoryError
-2

How about readline instead of read()

http://docs.python.org/2/tutorial/inputoutput.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.