What are some approaches / algorithms for reducing size of numerical data of large size with redundancies?

Question

I'm dealing with bunch of .asc(ascii) files that are the output of continous monitoring of various electronic equipments for certification purposes. We monitor various parameters of the equipments like voltage, current , temperature at different states(modes) of operation [sleep mode, minimal load , maximum load etc.]. The tests run for an average of 600-700 hours and the data is recorded every 2 seconds. At the end i have datasets of 100s of MBs which i want to reduce in size. On an average there are about million datapoints generated which are good but not necessarily important. It doesn't make sense for me to have For e.g 5 hours worth of same voltage reading that are well inside our tolerance levels( 9000 data points of same value).

What is crucial for me is that my program monitors the incoming stream of data, look for errors(tolerance breaches due to device error), and if for certain amount of time no error occurs( for e.g 10 mins post startup) the data should be bunched up into smaller set of datapoints ( reduction by a factor of 2 or 5 or similar) and continue this process until an error occurs at which point it would record the point of error as well 10 subsequent datapoints as is before switching back to the compression method.

What approaches could i take here to reduce this bunch of data into smaller size so that during analysis we end up with sensible data ( that is representative of the success of the tests ) but is also significantly smaller than what we get right now? Would averaging the data be a good option here? In the case averaging against data counts or time would be appropriate? I was also told to think about filtering ( kalman or moving average ) but i am not sure they would serve my purpose since i'm not looking to eliminate any wild data but rather reduce 100 numbers in similar range into 10 numbers.

Thank you in advance and as a first time poster, i'm open to any kind of suggestion regarding further research into this topic or posting style too. The problem is a bit critical and hence i'd be grateful for any and every suggestions pertaining to it.

Cross-posted: dsp.stackexchange.com/q/84650/5874, cs.stackexchange.com/q/154274/755, math.stackexchange.com/q/4536574/14578. Please do not post the same question on multiple sites. — D.W.
– D.W., Commented Sep 22, 2022 at 16:16

Max · Accepted Answer · 2022-09-22 10:47:26Z

1

What makes this dataset so big is not per se the number of samples. In your case, it's the ASCII format*. You should just convert the data to 32bit (or even just 16bit or lower, depends on the accuracy required) integers or floating point values, whatever suits you, and save them binary. By my caculation this should yield ~5MB files for each parameter for the whole 700hrs of recording.

*I assume that the data is logged in a way similar to this:

timestamp: 12345667788, voltage: 12356, current: 123345, etc...

Even without the strings and spaces, if the numbers are longer than 4 digits this is a waste of memory.

answered Sep 22, 2022 at 10:47

Max

2,38310 silver badges14 bronze badges

$\begingroup$ I have added the screenshot of how the ascii file looks like.. you are right about the way logged data looks like. But do you mean to say not to store in ASCII format and rather store in binary format? I don't think that is possible since the software outputs only in this particular format and we use .asc file for analysis purpose on some other tools. Sorry if i understood you wrong. $\endgroup$

Sajeev Pillai
– Sajeev Pillai

2022-09-22 11:19:07 +00:00
Commented Sep 22, 2022 at 11:19
$\begingroup$ I understood that the data is already there and the size is a problem for postprocessing only. A few hundred MBs (or even GBs nowadays) are no real issue for background storing, but processing them gets clumsy and slow. $\endgroup$

Max
– Max

2022-09-22 13:05:29 +00:00
Commented Sep 22, 2022 at 13:05
$\begingroup$ yes exactly. Since we need to analyse it somewhere else we would like to take the already provided data and process it into something small so that it works faster. :) $\endgroup$

Sajeev Pillai
– Sajeev Pillai

2022-09-22 14:08:34 +00:00
Commented Sep 22, 2022 at 14:08
$\begingroup$ But the processing tools only accept asc-files? $\endgroup$

Max
– Max

2022-09-23 05:37:34 +00:00
Commented Sep 23, 2022 at 5:37
$\begingroup$ Yes. The data is outputted in ascii only as far as i know. And we use that format itself for the processing tool as well. $\endgroup$

Sajeev Pillai
– Sajeev Pillai

2022-09-23 07:47:52 +00:00
Commented Sep 23, 2022 at 7:47

| Show 2 more comments

Hilmar · Accepted Answer · 2022-09-22 12:55:02Z

1

You have two issues here:

Number 1: your format is very inefficient. If you convert this from Ascii to a suitable binary format, you can probably reduce the size by a factor of 4 or so.

Number 2: your data is highly oversampled, you have way more data points than you need. The obvious answer here is down-sampling. The amount by which you can down-sample without losing information is depends on the spectral content of the signals you are sampling. You can calculate the power spectral density of the original original data and determine the highest frequency that still has some relevant content. Multiply this with 3 or 4 and use that as your new sample rate. Low-pass filter the original data at half the new sample rate and then down-sample by throwing out samples.

Determine the best choice for the new sample rate will require some analysis. According to the Nyquist theorem you only need twice the highest frequency of the signal, but I think you want a healthy margin on top of this so the same sample rate can be used for all future data collection as well.

answered Sep 22, 2022 at 12:55

Hilmar

50.4k1 gold badge33 silver badges69 bronze badges

$\begingroup$ I was also looking to this downsampling the final stored data . Can you please explain or guide me regarding this power spectral density ? Because all i am getting at my end is numbers(constant values at some points and fluctuations in case of error).. I am not sure how to interpret them as signal for finding this power spectral density ! $\endgroup$

Sajeev Pillai
– Sajeev Pillai

2022-09-23 14:44:25 +00:00
Commented Sep 23, 2022 at 14:44
$\begingroup$ Are you familiar with spectral analysis? The easiest would be Welch's method (en.wikipedia.org/wiki/Welch%27s_method) . There are functions available in both Python and Matlab. $\endgroup$

Hilmar
– Hilmar

2022-09-23 16:07:14 +00:00
Commented Sep 23, 2022 at 16:07
$\begingroup$ No i am not familiar with spectral analysis but i'd be open to learning it :) Thanks for the link. However, i am not very sure how to interpret my data as a signal worthy of spectral analysis as it's a constant stream of values( from parameters like current, voltage or temperature of DUT) of more or less same amplitude. Frequency is constant 0.5 Hz ( values are recorded every 2 seconds). There are occasionally peaks or drops due to error in recording system or DUT otherwise it's more or less a simple wave nothing complicated like supply voltage or current $\endgroup$

Sajeev Pillai
– Sajeev Pillai

2022-09-24 07:45:36 +00:00
Commented Sep 24, 2022 at 7:45

Add a comment |

Stack Exchange Network

What are some approaches / algorithms for reducing size of numerical data of large size with redundancies?

2 Answers 2

Your Answer

Hot Network Questions

What are some approaches / algorithms for reducing size of numerical data of large size with redundancies?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions