0

I am working on an investment app in Django which requires calculating portfolio balances and values over time. The database is currently set up this way:

class Ledger(models.Model):
    asset = models.ForeignKey('Asset', ....)
    amount = models.FloatField(...)
    date = models.DateTimeField(...)
    ...

class HistoricalPrices(models.Model): 
   asset = models.ForeignKey('Asset', ....)
   price = models.FloatField(...)
   date = models.DateTimeField(...)

Users enter transactions in the Ledger, and I update prices through APIs.

To calculate the balance for a day (note multiple Ledger entries for the same asset can happen on the same day):

def balance_date(date): 
    return Ledger.objects.filter(date__date__lte=date).values('asset').annotate(total_amount=Sum('amount'))

Trying to then get values for every day between the date of the first Ledger entry and today becomes more challenging. Currently I am doing it this way - assuming a start_date and end_date that are datetime.date() and tr_dates a list on unique dates on which transactions did occur (to avoid calculating balances on days where nothing happened) :

import pandas as pd 

idx = pd.date_range(start_date, end_date)
main_df = pd.DataFrame(index=tr_dates)
main_df['date_send'] = main_df.index
main_df['balances'] = main_df['date_send'].apply(lambda x: balance_date(x))
main_df = main_df.sort_index()
main_df.index = pd.DatetimeIndex(main_df.index)
main_df = main_df.reindex(idx, method='ffill')

This works but my issue is performance. It takes at least 150-200ms to run this, and then I need to get the prices for each date (all of them, not just transaction dates) and somehow match and multiply by the correct balances, which makes the run time about 800 ms or more.

Given this is a web app the view taking 800ms at minimum to calculate makes it hardly scalable, so I was wondering if anyone had a better way to do this?

EDIT - Simple example of expected input / output

Ledger entries (JSON format) :


[
  {
    "asset":"asset_1", 
    "amount": 10, 
    "date": "2015-01-01"
  }, 
  {
    "asset":"asset_2", 
    "amount": 15, 
    "date": "2017-10-15"
  },
  {
    "asset":"asset_1", 
    "amount": -5, 
    "date": "2018-02-09"
  },  
  {
    "asset":"asset_1", 
    "amount": 20, 
    "date": "2019-10-10"
  }, 
  {
    "asset":"asset_2", 
    "amount": 3, 
    "date": "2019-10-10"
  }
]

Sample Price from Historical Prices:

[
  {
    "date": "2015-01-01", 
    "asset": "asset_1"
    "price": 5, 
  },
  {
    "date": "2015-01-01", 
    "asset": "asset_2"
    "price": 15, 
  },
  {
    "date": "2015-01-02",
    "asset": "asset_1" 
    "price": 6, 
  },
  {
    "date": "2015-01-02",
    "asset": "asset_2" 
    "price": 11, 
  },
  ...
  {
    "date": "2017-10-15",
    "asset": "asset_1" 
    "price": 20
  }, 
  {
    "date": "2017-10-15", 
    "asset": "asset_2"
    "price": 30
  }
  {
]


In this case: tr_dates is ['2015-01-01', '2017-10-15', '2018-02-09', '2019-10-10'] date_range is ['2015-01-01', '2015-01-02', '2015-01-03'.... '2019-12-14, '2019-12-15']

Final output I am after: Balances by date with price by date and total value by date

date           asset       balance      price          value

2015-01-01     asset_1     10           5              50
2015-01-01     asset_2     0            10             0

.... balances do not change as there are no new Ledger entries but prices change

2015-01-02     asset_1     10           6              60
2015-01-02     asset_2     0            11             0

.... all dates between 2015-01-02 and 2017-10-15 (no change in balance but change in price)

2017-10-15     asset_1     10           20             200
2017-10-15     asset_2     15           30             450

... dates in between

2018-02-09     asset_1     5            .. etc based on price
2018-02-09     asset_2     15           .. etc based on price

... dates in between

2019-10-10     asset_1     25           .. etc based on price
2019-10-10     asset_2     18           .. etc based on price

... goes until the end of date_range

I have managed to get this working but takes about a second to compute and I ideally need this to be at least 10x faster if possible.

EDIT 2 Following ac2001 method:

ledger = (Ledger
              .transaction
              .filter(portfolio=p)
              .annotate(transaction_date=F('date__date'))
              .annotate(transaction_amount=Window(expression=Sum('amount'),
                                                  order_by=[F('asset').asc(), F('date').asc()],
                                                  partition_by=[F('asset')]))
              .values('asset', 'transaction_date', 'transaction_amount'))

df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date).dt.date
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df = df.groupby(by=['asset', 'transaction_date']).sum()

yields the following dataframe (with multiindex):

                                transaction_amount

asset       transaction_date

asset_1     2015-01-01          10.0
            2018-02-09          5.0
            2019-10-10          25.0
asset_2     2017-10-15          15.0
            2019-10-10          18.0

These balances are correct (and also yield correct results on more complex data) but now I need to find a way to ffill these results to all dates in between as well as from the last date 2019-10-10 to today 2019-12-15 but not sure how that works given the multi-index.

Final solution

Thanks to @ac2001's code and pointers I have come up with the following:


ledger = (Ledger
              .objects
              .annotate(transaction_date=F('date__date'))
              .annotate(transaction_amount=Window(expression=Sum('amount'),
                                                  order_by=[F('asset').asc(), F('date').asc()],
                                                  partition_by=[F('asset')]))
              .values('asset', 'transaction_date', 'transaction_amount'))

df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df['date_cast'] = pd.to_datetime(df.index).dt.date
df_grouped = df.groupby(by=['asset', 'date_cast']).last()
df_unstacked = df_.unstack(['asset'])
df_unstacked.index = pd.DatetimeIndex(df_unstacked.index)
df_unstacked = df_unstacked.reindex(idx)
df_unstacked = df_unstacked.ffill()

This gives me a matrix of asset by dates. I then get a matrix of prices by dates (from database) and multiply the two matrices.

Thanks

9
  • How many dates are in your pandas date_range? It looks like you are making a Ledger query for each date, which is inefficient. I am struggling to follow your models. Where is the ledger date coming from? Are you using the historicalprice model? I think the best approach is run one query (include the date). Set the date to the index in a df, resample the index to daily and ffill. If you post more of your model info I can try to help with the actual code. Commented Dec 15, 2019 at 5:17
  • Sorry I forgot to mention the Date in the Leger model. The user enters that. I then use that date to find the right price in the Historical Price model. But first, I need to determine the balance for each date. With this code, I am only querying for dates when a "transaction" was recorded in the Ledger table (tr_date), as this is the only instances where balances will change. Then I use the ffill method to copy balances, and hence not query for each and every date in date_range (which can get quite high, e.g. depending on the date of the first ledger entry). Commented Dec 15, 2019 at 13:01
  • Couple clarifications: what is tr_dates? In the code above are you doing anything with HistoricalPrices? The historical prices is just another problem after the main problem (getting daily ledger balances)? Commented Dec 15, 2019 at 13:15
  • It would be helpful to see sample data input and the expected output too. Always helpful on these type of data problems. Commented Dec 15, 2019 at 13:20
  • Ok will add it as an edit in a few minutes ! Did'nt want to make the description too long / complicated originally Commented Dec 15, 2019 at 13:36

2 Answers 2

1

I think this might take some back and forth. I think the best approach is to do this in a couple steps.

Let's start with getting asset balances daily and then we will merge the prices together. The transaction amount is a cumulative total. Does this look correct? I don't have your data so it is a little difficult for me to tell.

    ledger = (Ledger
              .objects
              .annotate(transaction_date=F('date__date'))
              .annotate(transaction_amount=Window(expression=Sum('amount'),
                                                  order_by=[F('asset').asc(), F('date').asc()],
                                                  partition_by=[F('asset')]))
              .values('asset', 'transaction_date', 'transaction_amount'))

    df = pd.DataFrame(list(ledger))
    df.transaction_date = pd.to_datetime(df.transaction_date)
    df.set_index('transaction_date', inplace=True)
    df.groupby('asset').resample('D').ffill()
    df = df.reset_index()  <--- added this line here

<---edit below --->

Then create a dataframe from HistoricalPrices and merge it with the ledger. You might have to adjust the merge criteria to ensure you are getting what you want, but I think this is the correct path.

# edit

    ledger = df
    prices = (HistoricalPrice
              .objects
              .annotate(transaction_date=F('date__date'))
              .values('asset', 'price', 'transaction_date'))

    prices = pd.DataFrame(list(prices))
    result = ledger.merge(prices, how='left', on=['asset', 'transaction_date'])

Depending on how you are using the data later, if you need a list of dicts which is a preferred method in Django templates, you can do that conversion with df.to_dict(orient='records')

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks ! let me check it out with the data and I'll let you know what comes out of it
Thanks for the method. Unfortunately I ran into 2 issues. 1) with a complex data set (lots of transactions across assets and dates, I get the following error cannot reindex a non-unique index with a method or limit and 2) There might be an N+1 issue as the first value for asset_1 (on 2015-01-01) is np.NaN and the value for asset_2 on 2019-10-10 is 15 (whereas expected is 18 as on that same date 3 were added). I managed to get the right totals and dates by tweaking a bit the code - see the new edit on my post to see changes and output
the cannot reindex is a pandas issue? You have multiple dates with the same assets? Is there another classifier? Try swapping the fills, bfill or ffill. You can also change the df.index sort and then mess with the bfill or ffill. Let me know if that works.
Thanks ! I have been tinkering around with your code and found a solution ! I will need to test it more but for now it seems to work and scale very well (e.g. more dates, more transactions don't add to the execution time of about 500ms which remains fairly contant (a few ms here and there) whereas the old method could go up to 1.5s when scaling). Once I have fully tested I will update the solution! If you want you can include it in your response which I am tagging as correct.
Great. I would love to see the solution.
|
0

If you want to group your Ledgers by date, then calculate the daily asset amount;

Ledger.objects.values('date__date').annotate(total_amount=Sum('amount'))

this should help (edit: fix typo)

second edit: assuming you want to group them by asset as well:

Ledger.objects.values('date__date', 'asset').annotate(total_amount=Sum('amount'))

3 Comments

This wouldn't separate by asset though would it? Each asset will have a different price so I need to calculate the balance of each asset per date
Updated the answer! hope that helps
Thanks ! but this gives me the total amount of Ledger entries for that specific date (e.g. 300 units of Asset 1 were bought on 2019-01-01 but if 600 were bought the week prior this won't be captured). I supposed I could dump this in a pandas dataFrame and do a cumulated sum, that might make it faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.