4

I have a outer/inner loop-based function I'm trying to vectorise using Python Polars DataFrames. The function is a type of moving average and will be used to filter time-series financial data. Here's the function:

def ma_j(df_src: pl.DataFrame, depth: float):

    jrc04 = 0.0
    jrc05 = 0.0
    jrc06 = 0.0
    jrc08 = 0.0

    series = df_src['close']

    for x in range(0, len(series)):
        if x >= x - depth*2:
            for k in np.arange(start=math.ceil(depth), stop=0, step=-1):
                jrc04 = jrc04 + abs(series[x-k] - series[x-(k+1)])
                jrc05 = jrc05 + (depth + k) * abs(series[x-k] - series[x-(k+1)])
                jrc06 = jrc06 + series[x-(k+1)]
        else:
            jrc03 = abs(series - (series[1]))
            jrc13 = abs(series[x-depth] - series[x - (depth+1)])
            jrc04 = jrc04 - jrc13 + jrc03
            jrc05 = jrc05 - jrc04 + jrc03 * depth
            jrc06 = jrc06 - series[x - (depth+1)] + series[x-1]
        jrc08 = abs(depth * series[x] - jrc06)

    if jrc05 == 0.0:
        ma = 0.0
    else:
        ma = jrc08/jrc05

    return ma

The tricky bit for me are multiple the inner loop look-backs (for k in...). I've looked through multiple examples that use group_by_dynamic on the timeseries data. For example, here. I've also seen an example for rolling, but this still seems to use a period.

However, I'd like to strip away the timeseries and just use source Series. Does this mean I need to group on an integer range?

Using this data example:

import polars as pl
import numpy as np

i, t, v = np.arange(0, 50, 1), np.arange(0, 100, 2), np.random.randint(1,101,50)
df = pl.DataFrame({"i": i, "t": t, "rand": v})
df = df.with_columns((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
cols = ["i", "datetime", "rand"]
df = df.select(cols)

DataFrame looks like this:

shape: (50, 3)
┌─────┬─────────────────────┬──────┐
│ i   ┆ datetime            ┆ rand │
│ --- ┆ ---                 ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ i64  │
╞═════╪═════════════════════╪══════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 87   │
│ 1   ┆ 2022-10-30 00:00:02 ┆ 66   │
│ 2   ┆ 2022-10-30 00:00:04 ┆ 30   │
│ 3   ┆ 2022-10-30 00:00:06 ┆ 87   │
│ 4   ┆ 2022-10-30 00:00:08 ┆ 74   │
│ …   ┆ …                   ┆ …    │
│ 45  ┆ 2022-10-30 00:01:30 ┆ 91   │
│ 46  ┆ 2022-10-30 00:01:32 ┆ 52   │
│ 47  ┆ 2022-10-30 00:01:34 ┆ 68   │
│ 48  ┆ 2022-10-30 00:01:36 ┆ 26   │
│ 49  ┆ 2022-10-30 00:01:38 ┆ 99   │
└─────┴─────────────────────┴──────┘

...I can do a grouping by datetime like this":

df.group_by_dynamic("datetime", every="10s").agg(
    pl.col("rand").mean().alias('rolling mean')
)

which gives this: enter image description here

But there's 3 issues with this:

  • I don't want to group of datetime...I want to group on every row (maybe i?) in bins of [x] size.
  • I need values against every row
  • I would like to define the aggregation function, as per the various cases in the function above

Any tips on how I could attack this using Polars? Thanks.

---------- Edit 1

Following @ritchie46 's awesome advice (thanks mate!), here's the groupby:

result_grp = (
    df
    .rolling(index_column="i", period="10i")
    .agg(
        pl.len().alias("rolling_slots"),
        pl.col("rand").mean().alias("roll_mean")
    )
)

df2 = df.select(
    pl.all(),
    result_grp.get_column("rolling_slots"),
    result_grp.get_column("roll_mean"),
)

This now gives:

shape: (50, 5)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┐
│ i   ┆ datetime            ┆ rand ┆ rolling_slots ┆ roll_mean │
│ --- ┆ ---                 ┆ ---  ┆ ---           ┆ ---       │
│ i64 ┆ datetime[μs]        ┆ i64  ┆ u32           ┆ f64       │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 23   ┆ 1             ┆ 23.0      │
│ 1   ┆ 2022-10-30 00:00:02 ┆ 72   ┆ 2             ┆ 47.5      │
│ 2   ┆ 2022-10-30 00:00:04 ┆ 46   ┆ 3             ┆ 47.0      │
│ 3   ┆ 2022-10-30 00:00:06 ┆ 37   ┆ 4             ┆ 44.5      │
│ 4   ┆ 2022-10-30 00:00:08 ┆ 12   ┆ 5             ┆ 38.0      │
│ …   ┆ …                   ┆ …    ┆ …             ┆ …         │
│ 45  ┆ 2022-10-30 00:01:30 ┆ 95   ┆ 10            ┆ 53.7      │
│ 46  ┆ 2022-10-30 00:01:32 ┆ 100  ┆ 10            ┆ 62.7      │
│ 47  ┆ 2022-10-30 00:01:34 ┆ 6    ┆ 10            ┆ 62.2      │
│ 48  ┆ 2022-10-30 00:01:36 ┆ 27   ┆ 10            ┆ 56.5      │
│ 49  ┆ 2022-10-30 00:01:38 ┆ 33   ┆ 10            ┆ 54.5      │
└─────┴─────────────────────┴──────┴───────────────┴───────────┘

This is great; now instead of mean(), how do I apply a custom function on the grouped values, such as:

f_jparams(depth_array, jrc04, jrc05, jrc06, jrc08):
    
    _depth = len(depth_array)
    
    if len(depth_array) > 3:
        for x in np.arange(start=1, stop=len(depth_array), step=1):
            jrc04 = jrc04 + abs(depth_array[x] - depth_array[x-1])
            jrc05 = jrc05 + (_depth+x) * abs(depth_array[x] - depth_array[x-1])
            jrc06 = jrc06 + depth_array[x-1]
    else:
        jrc03 = abs(depth_array[_depth-1] - depth_array[_depth-2])
        jrc13 = abs(depth_array[0] - depth_array[1])
        jrc04 = jrc04 - jrc13 + jrc03
        jrc05 = jrc05 - jrc04 + jrc03*_depth
        jrc06 = jrc06 - depth_array[1] + depth_array[_depth-2]
        
    jrc08 = abs(_depth * depth_array[0] - jrc06)
    
    if jrc05 == 0.0:
        ma = 0.0
    else:
        ma = jrc08/jrc05
        
    return ma, jrc04, jrc05, jrc06, jrc08

Thanks!

---- Edit 2:

Thanks to this post, I can collect up the items in the rand rolling group into a list for each row:

depth = 10

result_grp = (
    df
    .rolling(
        index_column="i", 
        period=str(depth) + "i",
        # offset="0i",
        # closed="left"
    )
    .agg(
        pl.len().alias("rolling_slots"),
        pl.col("rand").mean().alias("roll_mean"),
        pl.col("rand").name.suffix('_val_list'),
    )
)

df2 = df.select(
    pl.all(),
    result_grp.get_column("rolling_slots"),
    result_grp.get_column("roll_mean"),
    result_grp.get_column("rand_val_list"),
)

Also from this post, I saw a way to make the rolling window period a variable; nice!

Is there a way to use get_columns and exclude together so I don't have to list every col I want?

The dataframe now looks like:

shape: (50, 6)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┬─────────────────┐
│ i   ┆ datetime            ┆ rand ┆ rolling_slots ┆ roll_mean ┆ rand_val_list   │
│ --- ┆ ---                 ┆ ---  ┆ ---           ┆ ---       ┆ ---             │
│ i64 ┆ datetime[μs]        ┆ i64  ┆ u32           ┆ f64       ┆ list[i64]       │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╪═════════════════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 23   ┆ 1             ┆ 23.0      ┆ [23]            │
│ 1   ┆ 2022-10-30 00:00:02 ┆ 72   ┆ 2             ┆ 47.5      ┆ [23, 72]        │
│ 2   ┆ 2022-10-30 00:00:04 ┆ 46   ┆ 3             ┆ 47.0      ┆ [23, 72, 46]    │
│ 3   ┆ 2022-10-30 00:00:06 ┆ 37   ┆ 4             ┆ 44.5      ┆ [23, 72, … 37]  │
│ 4   ┆ 2022-10-30 00:00:08 ┆ 12   ┆ 5             ┆ 38.0      ┆ [23, 72, … 12]  │
│ …   ┆ …                   ┆ …    ┆ …             ┆ …         ┆ …               │
│ 45  ┆ 2022-10-30 00:01:30 ┆ 95   ┆ 10            ┆ 53.7      ┆ [10, 11, … 95]  │
│ 46  ┆ 2022-10-30 00:01:32 ┆ 100  ┆ 10            ┆ 62.7      ┆ [11, 84, … 100] │
│ 47  ┆ 2022-10-30 00:01:34 ┆ 6    ┆ 10            ┆ 62.2      ┆ [84, 53, … 6]   │
│ 48  ┆ 2022-10-30 00:01:36 ┆ 27   ┆ 10            ┆ 56.5      ┆ [53, 46, … 27]  │
│ 49  ┆ 2022-10-30 00:01:38 ┆ 33   ┆ 10            ┆ 54.5      ┆ [46, 56, … 33]  │
└─────┴─────────────────────┴──────┴───────────────┴───────────┴─────────────────┘

Should I just now resort back to looping through the rand_val_list column and send each grouped values list to my function? Or is there a better polars way?

Thanks again!

1
  • Sorry for the images...I can't get my df to print out nicely inline with text. Commented Jan 2, 2023 at 11:45

1 Answer 1

3

Are you searching for periods="10i"?

Polars rolling accepts a period argument with the following query language:

        - 1ns   (1 nanosecond)
        - 1us   (1 microsecond)
        - 1ms   (1 millisecond)
        - 1s    (1 second)
        - 1m    (1 minute)
        - 1h    (1 hour)
        - 1d    (1 day)
        - 1w    (1 week)
        - 1mo   (1 calendar month)
        - 1y    (1 calendar year)
        - 1i    (1 index count)

Where i is simply the number of indices/rows.

So on your data a rolling group_by where we count the number of slots would give:

(df.rolling(index_column="i", period="10i")
   .agg(
       pl.len().alias("rolling_slots")
   )
)
shape: (50, 2)
┌─────┬───────────────┐
│ i   ┆ rolling_slots │
│ --- ┆ ---           │
│ i64 ┆ u32           │
╞═════╪═══════════════╡
│ 0   ┆ 1             │
│ 1   ┆ 2             │
│ 2   ┆ 3             │
│ 3   ┆ 4             │
│ 4   ┆ 5             │
│ …   ┆ …             │
│ 45  ┆ 10            │
│ 46  ┆ 10            │
│ 47  ┆ 10            │
│ 48  ┆ 10            │
│ 49  ┆ 10            │
└─────┴───────────────┘
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, thanks Ritchie. This gives me the grouping I'm after, where I can use depth as the window size. Do you have any thoughts on how I can execute a custom function on the grouped values?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.