2

I want to get each line from a text file in Python ( around 1 billion lines) and from each line I am taking some words and inserting in another file I have used

with open('') as f:
   for line in f:
       process_line(line)

This process is taking a lot of time, How can I process this to read all the contents in about 2 hours ?

3
  • 2
    What does process_line actually do? Please show us the code. Commented Oct 31, 2018 at 16:47
  • its not exactly process_lines. Each line consists of "word,word1,word2" I 'm splitting these three words (.split(",")) and writing them to 3 separate files using f.write() Commented Oct 31, 2018 at 16:51
  • If processing each line is independent, this problem can be modelled under divide and conquer. First split the large file into smaller files using Linux split command. Later, run the same program on the split files, preferably in parallel. Commented Oct 31, 2018 at 18:46

2 Answers 2

3

The bottleneck of the performance of your script likely comes from the fact that it is writing to 3 files at the same time, causing massive fragmentation between the files and hence lots of overhead.

So instead of writing to 3 files at the same time as you read through the lines, you can buffer up a million lines (which should take less than 1GB of memory), before you write the 3 million words to the output files one file at a time, so that it will produce much less file fragmentation:

def write_words(words, *files):
    for i, file in enumerate(files):
        for word in words:
            file.write(word[i] + '\n')

words = []
with open('input.txt', 'r') as f, open('words1.txt', 'w') as out1, open('words2.txt', 'w') as out2, open('words3.txt', 'w') as out3:
    for count, line in enumerate(f, 1):
        words.append(line.rstrip().split(','))
        if count % 1000000 == 0:
            write_words(words, out1, out2, out3)
            words = []
    write_words(words, out1, out2, out3)
Sign up to request clarification or add additional context in comments.

7 Comments

Still, 2 hours sounds massively too long, even for a billion lines. OP's machine may be from the 1970s though.
I am using latest Macbook Pro ( 8 GB RAM + 512 GB mem)
But still, quick and dirty calculation, you are materializing a list of 10_000_000 lists (each with three items). even ignoring the actual strings, and assuming a list of 3 items is about 88 bytes (again, ignoring the actual strings, which are not trivial), I predict 10_000_000*8*88*1e-9 == 7.04 gigabytes .... take this down by an order of magnitude
@juanpa.arrivillaga Good point. I've updated the answer to buffer only a million lines at a time then.
Note, more realistically, I'm still predicting about 2-gigs, given an average string size of about 10, so about 60 bytes per string object, so: 1_000_000*8*(88 + 60*3)*1e-9 but this should be a good size
|
1

read about generators in Python. Yours code should look like this:

def read_file(yours_file):
    while True:
        data = yours_file.readline()
        if not data:
            break
        yield data

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.