1

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile. One option is to iterate all the devices - which seems to be slow. A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.

Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
    print(g)
    
    this_group = df[df.bar == g]
    # perform a UDF which needs to have all the values per group
    # i.e. for real I want to calculate the matrixprofile for each time-series of a device
    this_group['result'] = this_group.baz.apply(lambda x: 1)
    display(this_group)

print('***************************')

def my_non_scalar1_1_agg_function(x):
    display(pd.DataFrame(x))
    return x

# neatly vectorized application of a non_scalar function
# but this fails as:  Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
4
  • For this, we may need to see particulars of UDF. Commented Nov 9, 2020 at 12:54
  • Sure: gist.github.com/geoHeil/7344932b27f05bfaab551b3b948ac2c5 see for code which generates an exaample dataset and uses the stumpy.stump UDF. Commented Nov 9, 2020 at 13:31
  • I guess that the second (non accepted) answer: stackoverflow.com/questions/42171132/… should work here as well and give it a try Commented Nov 9, 2020 at 14:24
  • Does stumpy.stump return a single scalar value? Docs indicates it returns an ndarray of 4 columns. Please post example output of one call and what single scalar value you need to extract. Commented Nov 9, 2020 at 15:07

2 Answers 2

4

For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.

Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.

# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)

# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Sign up to request clarification or add additional context in comments.

Comments

0

Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

grouped_df = df.groupby(['bar'])

altered = []
for index, subframe in grouped_df:
    display(subframe)
    subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
    altered.append(subframe)
    print (index)
    #print (subframe)
   
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.