2

The algorithm just builds up a new list from an input data array. It only appends a new element from the input array once the element has crossed the visibleDelta threshold of the previous stored element:

def subsample(data, visibleDelta):
    subsampled = [data[0]]

    for point in data[1:]:
        if abs(point - subsampled[len(subsampled) - 1]) > visibleDelta:
            subsampled.append(point)

    return subsampled

Problem is I need this to run on very large datasets (~1B values), and I'd like to use numpy or some other numerical library to do this if possible.

I should probably mention that the 'real' function won't just deal with a 1D array of data. The input data will be a pandas dataframe, with the first column being x values, and the second being y values (I'll be comparing the y values).

Any way to do this efficiently?

1 Answer 1

2

if you want to track the data in this way, numpy is not the good tool, See Numba or Cython for efficiency.

A slightly different approach is to determine threshold and look when data reach them :

data=sin(arange(1e6)/3e4)
visibledelta=0.2
cat=floor(data/visibledelta)
subsample=arange(data.size-1)[diff(cat).astype(bool)]
plot(data)
plot(subsample,data[subsample],'o')

which give :

enter image description here

Some adjust may be done, but the data is splitted in chunks.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.