1

I have been trying to read a large file and writing to another file at the same time after processing the data from the input file, the file is pretty huge around 4-8 GB, is there a way to parallelise the process to save the time

The original program is:

with open(infile,"r") as filein:
with open(writefile,"w") as filewrite: 
    with open(errorfile,"w") as fileerror:
        line=filein.readline()
        count=0
        filewrite.write("Time,Request,IP,MAC\n")
        while line:
            count+=1
            line=filein.readline()
            #print "{}: {}".format(count,line.strip()) testing content
            if requestp.search(line):
                filewrite.write(line.strip()[:15]+",")
                filewrite.write(requestp.search(line).group()+",")
                if IP.search(line):
                    filewrite.write(IP.search(line).group())
                filewrite.write(",")
                if MACp.search(line):
                    filewrite.write(MACp.search(line).group())
                filewrite.write("\n")
            else:
                fileerror.write(line)

But this takes too much time to process such a file and I have 100's of such files, I've tried using Ipyparellel to parallilise the code but have not met with success yet, is there a way to do this.

6
  • split your input file in chunks, send each chunk to a distinct process, and merge the results. Basically, use the map/reduce pattern. Commented Jun 8, 2018 at 11:08
  • IMHO you should not try to parallelize sequential io. At most you could try to split in three: reading, processing, writing to use the time where io operations are blocking to do the processing. Commented Jun 8, 2018 at 11:44
  • @brunodesthuilliers can the files be splitted inside python itself ? Commented Jun 8, 2018 at 12:17
  • @SergeBallesta how to do that? I couldn't find how to split reading, processing and writing ? Commented Jun 8, 2018 at 12:21
  • 1
    It might help if you actually said what you are trying to do, if you showed sample lines of input and corresponding output, if you stated your OS... I suspect it would go considerably faster with awk, if your OS has that, and with GNU Parallel if your OS has that, and if you used a different disk for input and output, if you have that. Commented Jun 9, 2018 at 15:56

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.