2

I have the following DataFrame:

                              P     N  ID  Year  Month
TS                                                    
2016-06-26 19:30:00  263.600006   5.4   5  2016      6
2016-06-26 20:00:00  404.700012   5.6   5  2016      6
2016-06-26 21:10:00  438.600006   6.0   5  2016      6
2016-06-26 21:20:00  218.600006   5.6   5  2016      6
2016-07-02 16:10:00  285.300049  15.1   5  2016      7

I'm trying to add a new column based on the values of columns Year and Month something like the following

def exp_records(row):
    return calendar.monthrange(row['Year'], row['Month'])[1]
df['exp_counts'] = df.apply(exp_records, axis=1)

But I'm getting the following error:

TypeError: ('integer argument expected, got float', 'occurred at index 2016-06-26 19:30:00')

If I however reset_index() to integer then the above .apply() works fine. Is this the expected behavior?

I'm using using pandas 0.19.1 with Python 3.4


Code to recreate the DataFrame:

s = '''
TS,P,N,ID,Year,Month
2016-06-26 19:30:00,263.600006,5.4,5,2016,6
2016-06-26 20:00:00,404.700012,5.6,5,2016,6
2016-06-26 21:10:00,438.600006,6.0,5,2016,6
2016-06-26 21:20:00,218.600006,5.6,5,2016,6
2016-07-02 16:10:00,285.300049,15.1,5,2016,7
'''

df = pd.read_csv(pd.compat.StringIO(s), index_col=0, parse_dates=True)
3
  • Interestingly, when the index is of datetime, loc, iloc etc returns floats even when the column type is integer. That's probably a bug. You can change row['Year'] to int(row['Year']) as a workaround (and month too, of course). Or you can easily do df.index.days_in_month. Commented Jan 24, 2017 at 16:33
  • 1
    @ayhan: Also, when I tested the given data set with all int dtypes, then it returned the values appropriately. But when one of them was changed to float, then it coerced all columns to float type(supplying reduce=False also didn't help). That's why it was complaining to supply int as it's inputs. Moreover, this isn't specific to datetime, even integer indices show a similar behavior. Commented Jan 24, 2017 at 16:43
  • @NickilMaveli Yes I was also experimenting with float indices and I see the same issue with DataFrames having columns with different dtypes. Commented Jan 24, 2017 at 16:47

1 Answer 1

5

Solution

Use df[['Year', 'Month']] for apply:

df['exp_counts'] = df[['Year', 'Month']].apply(exp_records, axis=1)

Result:

                              P     N  ID  Year  Month  exp_counts
TS                                                                
2016-06-26 19:30:00  263.600006   5.4   5  2016      6          30
2016-06-26 20:00:00  404.700012   5.6   5  2016      6          30
2016-06-26 21:10:00  438.600006   6.0   5  2016      6          30
2016-06-26 21:20:00  218.600006   5.6   5  2016      6          30
2016-07-02 16:10:00  285.300049  15.1   5  2016      7          31

Reason

While your Year and Month columns are integer:

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2016-06-26 19:30:00 to 2016-07-02 16:10:00
Data columns (total 5 columns):
P        5 non-null float64
N        5 non-null float64
ID       5 non-null int64
Year     5 non-null int64
Month    5 non-null int64
dtypes: float64(2), int64(3)
memory usage: 240.0 bytes

You access them by row, which makes them floats:

df.T.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, P to Month
Data columns (total 5 columns):
2016-06-26 19:30:00    5 non-null float64
2016-06-26 20:00:00    5 non-null float64
2016-06-26 21:10:00    5 non-null float64
2016-06-26 21:20:00    5 non-null float64
2016-07-02 16:10:00    5 non-null float64
dtypes: float64(5)
memory usage: 240.0+ bytes

Since df.apply(exp_records, axis=1) goes by row, you essentially convert to rows.

This is what you get in exp_records for row:

P         263.600006
N           5.400000
ID          5.000000
Year     2016.000000
Month       6.000000
Name: 2016-06-26T19:30:00.000000000, dtype: float64

Creating a dataframe with the columns Year and Month only, does cause a converting to float because both columns a integers:

df[['Year', 'Month']].T.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Year to Month
Data columns (total 5 columns):
2016-06-26 19:30:00    2 non-null int64
2016-06-26 20:00:00    2 non-null int64
2016-06-26 21:10:00    2 non-null int64
2016-06-26 21:20:00    2 non-null int64
2016-07-02 16:10:00    2 non-null int64
dtypes: int64(5)
memory usage: 96.0+ bytes
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for explaining this has solved my problem. Solution proposed @ayhan is significantly faster but is always good to investigate further.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.