Parallel file reading in python

Ask Question

Asked 7 years, 5 months ago

Modified 7 years, 5 months ago

Viewed 659 times

I have been trying to read a large file and writing to another file at the same time after processing the data from the input file, the file is pretty huge around 4-8 GB, is there a way to parallelise the process to save the time

The original program is:

with open(infile,"r") as filein:
with open(writefile,"w") as filewrite: 
    with open(errorfile,"w") as fileerror:
        line=filein.readline()
        count=0
        filewrite.write("Time,Request,IP,MAC\n")
        while line:
            count+=1
            line=filein.readline()
            #print "{}: {}".format(count,line.strip()) testing content
            if requestp.search(line):
                filewrite.write(line.strip()[:15]+",")
                filewrite.write(requestp.search(line).group()+",")
                if IP.search(line):
                    filewrite.write(IP.search(line).group())
                filewrite.write(",")
                if MACp.search(line):
                    filewrite.write(MACp.search(line).group())
                filewrite.write("\n")
            else:
                fileerror.write(line)

But this takes too much time to process such a file and I have 100's of such files, I've tried using Ipyparellel to parallilise the code but have not met with success yet, is there a way to do this.

asked Jun 8, 2018 at 11:03

Harsh Sharma

3132 silver badges11 bronze badges

split your input file in chunks, send each chunk to a distinct process, and merge the results. Basically, use the map/reduce pattern.

bruno desthuilliers
– bruno desthuilliers

2018-06-08 11:08:06 +00:00
Commented Jun 8, 2018 at 11:08
IMHO you should not try to parallelize sequential io. At most you could try to split in three: reading, processing, writing to use the time where io operations are blocking to do the processing.

Serge Ballesta
– Serge Ballesta

2018-06-08 11:44:59 +00:00
Commented Jun 8, 2018 at 11:44
@brunodesthuilliers can the files be splitted inside python itself ?

Harsh Sharma
– Harsh Sharma

2018-06-08 12:17:30 +00:00
Commented Jun 8, 2018 at 12:17
@SergeBallesta how to do that? I couldn't find how to split reading, processing and writing ?

Harsh Sharma
– Harsh Sharma

2018-06-08 12:21:04 +00:00
Commented Jun 8, 2018 at 12:21
1

It might help if you actually said what you are trying to do, if you showed sample lines of input and corresponding output, if you stated your OS... I suspect it would go considerably faster with awk, if your OS has that, and with GNU Parallel if your OS has that, and if you used a different disk for input and output, if you have that.

Mark Setchell
– Mark Setchell

2018-06-09 15:56:57 +00:00
Commented Jun 9, 2018 at 15:56

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Parallel file reading in python

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest