3

Thanks in advance. I have written a program which works for small files. But that doesn't work for files of 1 GB. Please tell me is there is any way to handle big file. Here is the code.

fh=open('reg.fa','r')
c=fh.readlines()
fh.close() 
s=''  
for i in range(0,(len(c))):  
    s=s+c[i]  
    lines=s.split('\n')
    for line in s:
            s=s.replace('\n','')
s=s.replace('\n','')          
print s 
2
  • You should probably add more explanation. If reg.fa is to big for memory then I suspect s would also be too large. While it is easy enough to iterate in Python over some units you are still going to be constrained by memory. I don't think you want to read a line at a time and write it back out that would take a while. I think you will need to write to a new file because as you append your string you will be messing with the pointer. Commented May 6, 2009 at 19:44
  • You also don't need to specify range(0,len(c))) Until you get comfortable with the various iterators you can always do something like i in range(len(c)): Commented May 6, 2009 at 19:55

6 Answers 6

17

The readlines method reads in the entire file. You don't want to do that for a file that is large in relation to your physical memory size.

The fix is to read the file in small chunks, and process those individually. You can, for example, do something like this:

for line in f.xreadlines():
    ... do something with the line

The xreadlines does not return a list of lines, but an iterator, which returns one line at a time, when the for loop calls it. An even simpler way of doing that is:

for line in f:
    ... do something with the line

Depending on what you do, processing the file line-by-line may be easy or hard. I didn't really get what your sample code is trying to do, but it looks like it should be doable to do it by line.

Sign up to request clarification or add additional context in comments.

Comments

7

The script is not working because it reads all lines of the file in advance, making it nescessary to keep the whole file in memory. The easiest way to iterate over all lines in a file is

for line in open("test.txt", "r"):
    # do something with the "line"

Comments

5

With readlines() you read whole file at once, so you use 1 GB of memory. Insted of this try:

f = open(...)
while 1:
   line = f.readline()
   if not line:
     break
   line = line.rstrip()
   ... do something with line
   ... 
f.close()

If all you need is to remove \n then do not do it line by line, but do it with chunks of text:

import sys

f = open('query.txt','r')
while 1:
    part = f.read(1024)
    if not part:
        break
    part = part.replace('\n', '')
    sys.stdout.write(part)  

2 Comments

1024 is dumb low buffer size. You should increase it to at least 64KiB. Also it's stupid from python to not use generator in readlines-method.
The readlines method was added before Python had generators, and changing it later would have caused existing programs to break. That's the curse of evolving languages.
2

Your program is very redundant. Looks like everything you do can be done using these lines:

import sys
for line in open('reg.fa'):
    sys.stdout.write(line.rstrip())

That is enough. This program gives the same result from your original code in the question but is much simpler and clearer. And it can also handle files of any size.

1 Comment

Doesn't give exactly the same result: This strips all trailing whitespace on lines (not just the line terminator), and doesn't print a final newline
0

From your coding it is clear that you want string buffer of single line. As a point of view of coding it is bad that you storethe whole file content in one string buffer. And then you processed your requirement. And code contain too many local variables.

You could have used following chunk of code.

f = open (file_name,mode)

for line in f:

"""

Do the processing 

"""

Comments

0
import sys
import os

Use wb+ mode if file is not created, this will create file and also write data!

f = open('f_name.txt','wb+')
while 1:
    part = f.read(1024)
    if not part:
        break
    part = part.replace('\n', '')
    sys.stdout.write(part) 
 f.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.