I have to read through 2 different types of files at the same time in order to synchronise their data. The files are generated in parallel with different frequencies.
File 1, which will be very big in size (>10 GB) has the structure as follows : DATA is a field containing 100 characters and the number that follows it is a synchronisation signal that is common for both files (i.e. they change at the same time in both files).
DATA 1
DATA 1
... another 4000 lines
DATA 1
DATA 0
... another 4000 lines and so on
File 2, small in size (at most 10 MB but more in number) has the same structure the difference being in the number of rows between the synchronisation signal change:
DATA 1
... another 300-400 lines
DATA 1
DATA 0
... and so on
Here is the code that I use to read the files:
def getSynchedChunk(fileHandler, lastSynch, end_of_file):
line_vector = []; # initialize output array
for line in fileHandler: # iterate over the file
synch = int(line.split(';')[9]); # get synch signal
line_vector.append(line);
if synch != lastSynch: # if a transition is detected
lastSynch = synch; # update the lastSynch variable for later use
return (lastSynch, line_vector, True); # and exit - True = sycnh changed
return (lastSynch, line_vector, False); # exit if end of file is reached
I have to synchronise the data chunks (the lines that have the same synch signal value) and write the new lines to another file. I am using Spyder.
For testing, I used smaller sized files, 350 MB for FILE 1 and 35 MB for FILE 2. I also used the built-in Profiler to see where is the most time spent and it seems that 28s out of 46s is spent in actually reading the data from the files. The rest is used in synchronising the data and writing to the new file.
If I scale the time up to files sized in gigs, it will take hours to finish the processing. I will try to change the way I do the processing to make it faster, but is there a faster way to read through big files?
One line of data looks like this :
01/31/19 08:20:55.886;0.049107050;-0.158385641;9.457415342;-0.025256720;-0.017626805;-0.000096349;0.107;-0.112;0
The values are sensor measurements. The last number is the synch value.
DATAand an excerpt of the first lines of the file.low_memory=True, you are better off not using it. It will split the reading in chunks which will reduce the memory consumption, but will most probably make it slower. Thus only uselow_memory=Trueif you run into aMemoryError. I'll add a short example in my answer in the next minutes.