1

I'm practicing with using apply with Pandas dataframes.

So I have cooked up a simple dataframe with dates, and values:

dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})

I have a second dataframe, which is made up of 3 rows of the original dataframe:

DFa = DF.iloc[[1,2,4]]

So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:

def foo(DFa, DF=DF):
    cutoff_date = DFa['date']
    ans=DF[DF['date'] < cutoff_date]

DFa.apply(foo, axis=1)

Things work fine. My question is, since I've created 3 ans, how do I access these values?

Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.

5
  • I don't think apply is best option for this. If I understand correctly why not DFa[DF.index].sum()? Commented Jun 11, 2015 at 0:56
  • I agree, it's a pretty lousy example. My main problem is trying to return from the apply. I would really like to see how I could return 3 different dataframes, and sum them up elsewhere (but I didn't mention that in the question appropriately). Commented Jun 11, 2015 at 1:56
  • 1
    That's okay, it's possible that groupby might be a better alternative to look into. You can specify groups for the 3 subsets then simply use the sum method on the resulting groupby object. Commented Jun 11, 2015 at 2:30
  • 2
    @MattO'Brien: The performance of DF.apply(func, axis=1) is comparable to calling func in a loop. apply is useful when you want to align the output into a single DataFrame. If you need to return 3 disparate DataFrames, go ahead and loop over DF.iterrows(). For better performance you'll have to think of a better way to calculate the result (such as doing a sorted cumsum for the toy example above) or perhaps use Cython. Commented Jun 11, 2015 at 11:42
  • wow @unutbu you just laid it down right there, thanks Commented Jun 11, 2015 at 16:32

2 Answers 2

1

Your function needs to return a value. E.g.,

def foo(df1, df2):
    cutoff_date = df1.date
    ans = df2[df2.date < cutoff_date].value.sum()
    return ans


DFa.apply(lambda x: foo(x, DF), axis=1)

Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames

Sign up to request clarification or add additional context in comments.

1 Comment

So hypothetically, what if I actually wanted to return a dataframe of dataframes? DF_of_DFs = DFa.apply(lambda x: foo(x, DF), axis=1) doesn't seem to be appropriate...
1

There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.

Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.

So the code needs to look something like this:

def foo(row, DF=DF):
    cutoff_date = row['date']
    return DF[DF['date'] < cutoff_date].value.sum()

Once you make the changes, as foo returns a scalar, then apply will return a series:

>> DFa.apply(foo, axis=1)
1     1
2     3
4    10
dtype: int64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.