1

I have a dataframe with multiple columns.There's column named 'remaining_lease' which has 75% Nan. I don't want to drop that column. So I want to calculate the 'remaining_lease' using two other columns, 'lease_commense_date' and 'current_year'. Formula for that is:

remaining_lease = 99 - ( current_year - lease_commense_date)

for eg: current_year = 2022 and lease_commense_date = 1979

then remaining_lease = 99 - (2022 - 1979) = 56

I have written a function in order to do so.

def remaining_lease_year(x, current_year, commense_year):
    if math.isnan(x): # if the value is nan
        lease_year = 99 - (current_year - commense_year)
        return lease_year
    else: #if the value is not nan
        return x

df['remaining_lease'] = df['remaining_lease'].apply(lambda x: remaining_lease_year(x, df['current_year'], df['lease_commence_date']))

But I am getting an error:

MemoryError: Unable to allocate 7.08 MiB for an array with shape (927465,) and data type int64

Is there any other way to achieve it?

3
  • 1
    7 MiB is not much, do you run this on an old computer or embedded system? Commented Jun 14, 2024 at 6:41
  • I have a high end pc, but I don't know why I am getting this error Commented Jun 14, 2024 at 6:43
  • maybe you should rather show full error message and we could try to resolve this problem. Commented Jun 14, 2024 at 12:10

4 Answers 4

1

You can use the Pandas mask function which is more efficient that using `apply' with a function. To make a simple example:

import pandas as pd

df = pd.DataFrame({ 'current_year': [2022, 2023, 2022, 2023, 2022, 2023],
                   'lease_commence_date': [1969, 1974, 2000, 1998, 1983, 1992],
                    'remaining_lease': [50, None, 60 , 55, None, None],
                   })

df['remaining_lease'] = (df['remaining_lease']
                         .mask(df['remaining_lease'].isna(),
                        99 - (df['current_year'] - df['lease_commence_date']))
                        .astype(int))

print(df)

which gives:

   current_year  lease_commence_date  remaining_lease
0          2022                 1969               50
1          2023                 1974               50
2          2022                 2000               60
3          2023                 1998               55
4          2022                 1983               60
5          2023                 1992               68

The casting to int is needed as the None/NaN values will result in a column of float values.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use pandas columns operation.

df['remaining_lease'] = df['remaining_lease'].fillna(99 - ( df["current_year"] - df["lease_commense_date"]))

Comments

0

The spike comes from a N**2 memory overhead of columns:

One problem of your code is that: df['remaining_lease'].apply is a Series.apply and goes over each entry in that Series/column. No problem so far but df['current_year'], df['lease_commence_date'] in the lambda function are the whole columns which are constant.
Then, in remaining_lease_year function, i.e lease_year = 99 - (current_year - commense_year) is a Series and not a single value! These than are tried to be stored as whole Series of len(df) into each cell and therefore accumulate until no more memory is free.


One Series is about 927465 * 8 / 1_000_000 = 7.4 MB = 7.08 MiB, take that * 927465 = 6.8 TB


What you can do instead is:

def remaining_lease_year(row):
    x = row["remaining_lease"]
    if math.isnan(x): # if the value is nan
        lease_year = 99 - (row["current_year"] - row["lease_commence_date"])
        return lease_year
    else: #if the value is not nan
        return x

df['remaining_lease'] = df.apply(remaining_lease_year, axis=1)
# Output with the values provided by user19077881:
# df['remaining_lease']:
# 0    50.0
# 1    50.0
# 2    60.0
# 3    55.0
# 4    60.0
# 5    68.0

Comments

0

For a one line solution, you can just use this.

df['remaining_lease'] = df.apply(lambda x: 99 - (x.current_year - x.commense_year) if math.isnan(x.remaining_lease) else x.remaining_lease, axis=1)

1 Comment

Will try this one

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.