Pandas Python: Merging every two rows in one dataframe

Question

How do I get from

Idx            A B C
2004-04-01     1 1 0
2004-04-02     1 1 0
2004-05-01     0 0 0
2004-05-02     0 0 0

to

Idx            A B C
2004-04        2 2 0
2004-05        0 0 0

Notes: How do I collapse both the index (more specifically, making the index convert into just the month) and every two rows?

Is using a rolling mean the best way?

UPDATE - I made the above version simple, but unutbu's answer does not seem to work

                       Time      A   B
1    2004-01-04 - 2004-01-10     0   0
2    2004-01-11 - 2004-01-17     0   0
3    2004-01-18 - 2004-01-24     0   0
4    2004-01-25 - 2004-01-31     0   0
5    2004-02-01 - 2004-02-07     56  0
6    2004-02-08 - 2004-02-14     67  0

Do you want to group by the year and month, or just somehow merge every two rows regardless of the Idx values? What should the merged Idx value be if the year and months differ? — unutbu
– unutbu, Commented Apr 24, 2014 at 19:05
I want to just merge every two rows. *however, could you also show an implementation that does consider row value, for learning's sake? the year and month values shouldn't differ - they are same for every two rows. if they do differ, you can default it to the later year and month value — user3314418
– user3314418, Commented Apr 24, 2014 at 19:10

unutbu · Accepted Answer · 2014-04-24 19:58:30Z

8

You can aggregate rows using a groupby/sum operation:

import pandas as pd
import numpy as np

df = pd.DataFrame([('2004-04-01', 1L, 1L, 0L), ('2004-04-02', 1L, 1L, 0L),
       ('2004-05-01', 0L, 0L, 0L), ('2004-05-02', 0L, 0L, 0L)],
                  columns=['Idx', 'A', 'B', 'C'])
df['Idx'] = pd.DatetimeIndex(df['Idx'])

You could group by the year and month:

print(df.groupby([d.strftime('%Y-%m') for d in df['Idx']]).sum())
#          A  B  C
# 2004-04  2  2  0
# 2004-05  0  0  0

# [2 rows x 3 columns]

Or, group by every two rows:

result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Idx']
print(result)
#             A  B  C
# Idx                
# 2004-04-02  2  2  0
# 2004-05-02  0  0  0

# [2 rows x 3 columns]

Note: df.loc[1::2, 'Idx'] was used, instead of df.loc[::2, 'Idx'] so the Idx for the aggregated rows would correspond to the second date, not the first, in each group.

If you want just the year and month, then you could use this list comprehension to set the index:

result.index = [d.strftime('%Y-%m') for d in df.loc[1::2, 'Idx']]
print(result)
#          A  B  C
# 2004-04  2  2  0
# 2004-05  0  0  0

# [2 rows x 3 columns]

However, it's more powerful to have a DatetimeIndex for the index rather than a string-valued index when dealing with dates. So you might want to retain the DatetimeIndex, do most of your work with the DatetimeIndex, and just convert to a year-month string at the end for presentation purposes...

Regarding the UPDATED question:

import pandas as pd
import numpy as np

data = np.rec.array([('2004-01-04 - 2004-01-10', 0L, 0L),
       ('2004-01-11 - 2004-01-17', 0L, 0L),
       ('2004-01-18 - 2004-01-24', 0L, 0L),
       ('2004-01-25 - 2004-01-31', 0L, 0L),
       ('2004-02-01 - 2004-02-07', 56L, 0L),
       ('2004-02-08 - 2004-02-14', 67L, 0L)], 
      dtype=[('Time', 'O'), ('A', '<i8'), ('B', '<i8')])
df = pd.DataFrame(data)

Having one Time column holding two dates makes data manipulation more difficult. It would be better to have two DatetimeIndex columns, Start and End:

df[['Start', 'End']] = df['Time'].str.extract('(?P<Start>.+) - (?P<End>.+)')
del df['Time']
df['Start'] = pd.DatetimeIndex(df['Start'])
df['End'] = pd.DatetimeIndex(df['End'])

Then you could group by the Start column:

print(df.groupby([d.strftime('%Y-%m') for d in df['Start']]).sum())
#            A  B
# 2004-01    0  0
# 2004-02  123  0

# [2 rows x 2 columns]

Or group by every two rows, essentially the same as before:

result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Start']
print(result)
#               A  B
# Start             
# 2004-01-11    0  0
# 2004-01-25    0  0
# 2004-02-08  123  0

# [3 rows x 2 columns]

edited Apr 24, 2014 at 19:58

answered Apr 24, 2014 at 19:22

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3314418 Over a year ago

hm... I keep getting: AssertionError: Grouper and axis must be same length. updated my post to show you my actual data...

unutbu Over a year ago

You would get that error if np.arange(len(df))//2 did not have the same length as df itself. Here's a wild guess: Make sure you have np.arange(len(df))//2 and not np.arange(len(df)//2) for example. (Note the parentheses.)

user3314418 Over a year ago

i wish i could upvote you five times. you are a master. if I wanted to get better at groupby and its associated functions, should I go through python for data analysis? i find the documentation a little hard to get through

user3314418 Over a year ago

one clarification: I know the d.strftime(%Y-%m) for loop gives us a list of the year and month. It's unclear to me what it means when you put that into the groupby?

unutbu Over a year ago

The groupby method can accept an amazing assortment of arguments as input. Usually you just pass a list of the columns you wish to groupby. In this case, however, a sequences of values is being passed. These values are being used as proxy values for each row. Rows which have the same proxy value get grouped together. I haven't read "Python for Data Analysis", but it is written by the creator of Pandas, and looks to be highly recommended. There's no royal road to learning, and everything you learn will make the next step a little easier. So just keep plugging away and you'll get there!

Collectives™ on Stack Overflow

Pandas Python: Merging every two rows in one dataframe

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related