3

How do I get from

Idx            A B C
2004-04-01     1 1 0
2004-04-02     1 1 0
2004-05-01     0 0 0
2004-05-02     0 0 0

to

Idx            A B C
2004-04        2 2 0
2004-05        0 0 0

Notes: How do I collapse both the index (more specifically, making the index convert into just the month) and every two rows?

Is using a rolling mean the best way?

UPDATE - I made the above version simple, but unutbu's answer does not seem to work

                       Time      A   B
1    2004-01-04 - 2004-01-10     0   0
2    2004-01-11 - 2004-01-17     0   0
3    2004-01-18 - 2004-01-24     0   0
4    2004-01-25 - 2004-01-31     0   0
5    2004-02-01 - 2004-02-07     56  0
6    2004-02-08 - 2004-02-14     67  0
2
  • 1
    Do you want to group by the year and month, or just somehow merge every two rows regardless of the Idx values? What should the merged Idx value be if the year and months differ? Commented Apr 24, 2014 at 19:05
  • I want to just merge every two rows. *however, could you also show an implementation that does consider row value, for learning's sake? the year and month values shouldn't differ - they are same for every two rows. if they do differ, you can default it to the later year and month value Commented Apr 24, 2014 at 19:10

1 Answer 1

8

You can aggregate rows using a groupby/sum operation:

import pandas as pd
import numpy as np

df = pd.DataFrame([('2004-04-01', 1L, 1L, 0L), ('2004-04-02', 1L, 1L, 0L),
       ('2004-05-01', 0L, 0L, 0L), ('2004-05-02', 0L, 0L, 0L)],
                  columns=['Idx', 'A', 'B', 'C'])
df['Idx'] = pd.DatetimeIndex(df['Idx'])

You could group by the year and month:

print(df.groupby([d.strftime('%Y-%m') for d in df['Idx']]).sum())
#          A  B  C
# 2004-04  2  2  0
# 2004-05  0  0  0

# [2 rows x 3 columns]

Or, group by every two rows:

result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Idx']
print(result)
#             A  B  C
# Idx                
# 2004-04-02  2  2  0
# 2004-05-02  0  0  0

# [2 rows x 3 columns]

Note: df.loc[1::2, 'Idx'] was used, instead of df.loc[::2, 'Idx'] so the Idx for the aggregated rows would correspond to the second date, not the first, in each group.

If you want just the year and month, then you could use this list comprehension to set the index:

result.index = [d.strftime('%Y-%m') for d in df.loc[1::2, 'Idx']]
print(result)
#          A  B  C
# 2004-04  2  2  0
# 2004-05  0  0  0

# [2 rows x 3 columns]

However, it's more powerful to have a DatetimeIndex for the index rather than a string-valued index when dealing with dates. So you might want to retain the DatetimeIndex, do most of your work with the DatetimeIndex, and just convert to a year-month string at the end for presentation purposes...


Regarding the UPDATED question:

import pandas as pd
import numpy as np

data = np.rec.array([('2004-01-04 - 2004-01-10', 0L, 0L),
       ('2004-01-11 - 2004-01-17', 0L, 0L),
       ('2004-01-18 - 2004-01-24', 0L, 0L),
       ('2004-01-25 - 2004-01-31', 0L, 0L),
       ('2004-02-01 - 2004-02-07', 56L, 0L),
       ('2004-02-08 - 2004-02-14', 67L, 0L)], 
      dtype=[('Time', 'O'), ('A', '<i8'), ('B', '<i8')])
df = pd.DataFrame(data)

Having one Time column holding two dates makes data manipulation more difficult. It would be better to have two DatetimeIndex columns, Start and End:

df[['Start', 'End']] = df['Time'].str.extract('(?P<Start>.+) - (?P<End>.+)')
del df['Time']
df['Start'] = pd.DatetimeIndex(df['Start'])
df['End'] = pd.DatetimeIndex(df['End'])

Then you could group by the Start column:

print(df.groupby([d.strftime('%Y-%m') for d in df['Start']]).sum())
#            A  B
# 2004-01    0  0
# 2004-02  123  0

# [2 rows x 2 columns]

Or group by every two rows, essentially the same as before:

result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Start']
print(result)
#               A  B
# Start             
# 2004-01-11    0  0
# 2004-01-25    0  0
# 2004-02-08  123  0

# [3 rows x 2 columns]
Sign up to request clarification or add additional context in comments.

5 Comments

hm... I keep getting: AssertionError: Grouper and axis must be same length. updated my post to show you my actual data...
You would get that error if np.arange(len(df))//2 did not have the same length as df itself. Here's a wild guess: Make sure you have np.arange(len(df))//2 and not np.arange(len(df)//2) for example. (Note the parentheses.)
i wish i could upvote you five times. you are a master. if I wanted to get better at groupby and its associated functions, should I go through python for data analysis? i find the documentation a little hard to get through
one clarification: I know the d.strftime(%Y-%m) for loop gives us a list of the year and month. It's unclear to me what it means when you put that into the groupby?
The groupby method can accept an amazing assortment of arguments as input. Usually you just pass a list of the columns you wish to groupby. In this case, however, a sequences of values is being passed. These values are being used as proxy values for each row. Rows which have the same proxy value get grouped together. I haven't read "Python for Data Analysis", but it is written by the creator of Pandas, and looks to be highly recommended. There's no royal road to learning, and everything you learn will make the next step a little easier. So just keep plugging away and you'll get there!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.