Read Large Text file in Python

Question

I want to get each line from a text file in Python ( around 1 billion lines) and from each line I am taking some words and inserting in another file I have used

with open('') as f:
   for line in f:
       process_line(line)

This process is taking a lot of time, How can I process this to read all the contents in about 2 hours ?

What does process_line actually do? Please show us the code. — blhsing
– blhsing, Commented Oct 31, 2018 at 16:47
its not exactly process_lines. Each line consists of "word,word1,word2" I 'm splitting these three words (.split(",")) and writing them to 3 separate files using f.write() — user5713676
– user5713676, Commented Oct 31, 2018 at 16:51
If processing each line is independent, this problem can be modelled under divide and conquer. First split the large file into smaller files using Linux split command. Later, run the same program on the split files, preferably in parallel. — praneeth
– praneeth, Commented Oct 31, 2018 at 18:46

blhsing · Accepted Answer · 2018-10-31 18:47:33Z

3

The bottleneck of the performance of your script likely comes from the fact that it is writing to 3 files at the same time, causing massive fragmentation between the files and hence lots of overhead.

So instead of writing to 3 files at the same time as you read through the lines, you can buffer up a million lines (which should take less than 1GB of memory), before you write the 3 million words to the output files one file at a time, so that it will produce much less file fragmentation:

def write_words(words, *files):
    for i, file in enumerate(files):
        for word in words:
            file.write(word[i] + '\n')

words = []
with open('input.txt', 'r') as f, open('words1.txt', 'w') as out1, open('words2.txt', 'w') as out2, open('words3.txt', 'w') as out3:
    for count, line in enumerate(f, 1):
        words.append(line.rstrip().split(','))
        if count % 1000000 == 0:
            write_words(words, out1, out2, out3)
            words = []
    write_words(words, out1, out2, out3)

edited Oct 31, 2018 at 18:47

answered Oct 31, 2018 at 17:14

blhsing

109k9 gold badges88 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jongware Over a year ago

Still, 2 hours sounds massively too long, even for a billion lines. OP's machine may be from the 1970s though.

user5713676 Over a year ago

I am using latest Macbook Pro ( 8 GB RAM + 512 GB mem)

juanpa.arrivillaga Over a year ago

But still, quick and dirty calculation, you are materializing a list of 10_000_000 lists (each with three items). even ignoring the actual strings, and assuming a list of 3 items is about 88 bytes (again, ignoring the actual strings, which are not trivial), I predict 10_000_000*8*88*1e-9 == 7.04 gigabytes .... take this down by an order of magnitude

blhsing Over a year ago

@juanpa.arrivillaga Good point. I've updated the answer to buffer only a million lines at a time then.

juanpa.arrivillaga Over a year ago

Note, more realistically, I'm still predicting about 2-gigs, given an average string size of about 10, so about 60 bytes per string object, so: 1_000_000*8*(88 + 60*3)*1e-9 but this should be a good size

|

fuwiak · Accepted Answer · 2018-10-31 19:31:31Z

1

read about generators in Python. Yours code should look like this:

def read_file(yours_file):
    while True:
        data = yours_file.readline()
        if not data:
            break
        yield data

answered Oct 31, 2018 at 19:31

fuwiak

7411 gold badge8 silver badges25 bronze badges

Collectives™ on Stack Overflow

Read Large Text file in Python

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related