3

I have to read through 2 different types of files at the same time in order to synchronise their data. The files are generated in parallel with different frequencies.

File 1, which will be very big in size (>10 GB) has the structure as follows : DATA is a field containing 100 characters and the number that follows it is a synchronisation signal that is common for both files (i.e. they change at the same time in both files).

DATA 1
DATA 1
... another 4000 lines
DATA 1
DATA 0
... another 4000 lines and so on

File 2, small in size (at most 10 MB but more in number) has the same structure the difference being in the number of rows between the synchronisation signal change:

DATA 1
... another 300-400 lines
DATA 1
DATA 0
... and so on

Here is the code that I use to read the files:

def getSynchedChunk(fileHandler, lastSynch, end_of_file):

    line_vector = [];                         # initialize output array
    for line in fileHandler:                  # iterate over the file
        synch = int(line.split(';')[9]);      # get synch signal
        line_vector.append(line);         
        if synch != lastSynch:                # if a transition is detected
            lastSynch = synch;                # update the lastSynch variable for later use
            return (lastSynch, line_vector, True); # and exit - True = sycnh changed

     return (lastSynch, line_vector, False); # exit if end of file is reached

I have to synchronise the data chunks (the lines that have the same synch signal value) and write the new lines to another file. I am using Spyder.

For testing, I used smaller sized files, 350 MB for FILE 1 and 35 MB for FILE 2. I also used the built-in Profiler to see where is the most time spent and it seems that 28s out of 46s is spent in actually reading the data from the files. The rest is used in synchronising the data and writing to the new file.

If I scale the time up to files sized in gigs, it will take hours to finish the processing. I will try to change the way I do the processing to make it faster, but is there a faster way to read through big files?


One line of data looks like this :

01/31/19 08:20:55.886;0.049107050;-0.158385641;9.457415342;-0.025256720;-0.017626805;-0.000096349;0.107;-0.112;0

The values are sensor measurements. The last number is the synch value.

14
  • Spyder is an IDE and should in general not influence the outcome of you script. Far more interesting would be information like the file extension of DATA and an excerpt of the first lines of the file. Commented Feb 7, 2019 at 10:18
  • According to your code, you are joining the first line where synch changed with the previous lines. Is it the expected behavior ? Commented Feb 7, 2019 at 10:20
  • 1
    Do the lines of the two files need to be interleaved, or can you just concatenate the chunks? Commented Feb 7, 2019 at 10:31
  • 1
    @EduardPalkoMate Ok, I extended my solution. Does it work? Commented Feb 8, 2019 at 9:44
  • 1
    You are welcome! If it is working without low_memory=True, you are better off not using it. It will split the reading in chunks which will reduce the memory consumption, but will most probably make it slower. Thus only use low_memory=True if you run into a MemoryError. I'll add a short example in my answer in the next minutes. Commented Feb 8, 2019 at 10:03

2 Answers 2

1

I recommend reading in the whole files first and then do the processing. This has the huge advantage, that all the appending/concatenating etc. while reading is done internally with optimized modules. The synching can be done afterwards.

For this purpose I strongly recommend using pandas, which is imho by far the best tool to work with timeseries data like measurements.

Importing your files, guessing csv in a text file is the correct format, can be done with:

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0, 
    parse_dates=True, infer_datetime_format=True, dayfirst=True)

To reduce memory consumption, you can either specify a chunksize to split the file reading, or low_memory=True to internally split the file reading process (assuming that the final dataframe fits in your memory):

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0, 
    parse_dates=True, infer_datetime_format=True, dayfirst=True,
    low_memory=True)

Now your data will be stored in a DataFrame, which is perfect for time series. The index is already converted to a DateTimeIndex, which will allow for nice plotting, resampling etc. etc...

The sync state can now be easily accessed like in a numpy array (just adding the iloc accessing method) with:

df.iloc[:, 8]  # for all sync states
df.iloc[0, 8]  # for the first synch state
df.iloc[1, 8]  # for the second synch state

This is ideal for using fast vectorized synching of two or more files.


To read the file depending on the available memory:

try:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0, 
        parse_dates=True, infer_datetime_format=True, dayfirst=True)
except MemoryError:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0, 
        parse_dates=True, infer_datetime_format=True, dayfirst=True,
        low_memory=True)

This try/except solution might not be an elegant solution since it will take some time before the MemoryError is raised, but it is failsafe. And since low_memory=True will most probably reduce the file reading performance in most cases, the try block should be faster in most cases.

Sign up to request clarification or add additional context in comments.

2 Comments

What about memory consumption in this approach? This should work with files bigger than 10GB.
I was just about to add some information concerning this matter. :)
1

I'm not used to Spyder but you can try to use multithreading for chunking the big files, Python has an option for this without any external library so it will probably work with Spyder as well. (https://docs.python.org/3/library/threading.html)

The process of chunking:

  1. Get the length of the file in lines
  2. Start cutting the list to halfs until its "not too big"
  3. Use a thread for each small chunk.
  4. Profit

3 Comments

I will check it out. Thank you.
This is way too simplistic. A divide and conquer approach only benefits if processing the chunks can be non-linear. It also doesn't account for the necessity of thread workloads containing data of different logical chunks.
That's right. It needs to be tested. There is a point when the threading actually will slow down the things. That's why I wrote not too big. So do not try to divide it down to really small pieces.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.