Iterate through rows of grouped pandas dataframe to create new columns

Question

I'm new to Python and am trying to get to grips with Pandas for data analysis.

I wondered if anyone can help me loop through rows of grouped data in a dataframe to create new variables.

Suppose I have a dataframe called data, that looks like this:

+----+-----------+--------+
| ID | YearMonth | Status |
+----+-----------+--------+
|  1 |    201506 |      0 |
|  1 |    201507 |      0 |
|  1 |    201508 |      0 |
|  1 |    201509 |      0 |
|  1 |    201510 |      0 |
|  2 |    201506 |      0 |
|  2 |    201507 |      1 |
|  2 |    201508 |      2 |
|  2 |    201509 |      3 |
|  2 |    201510 |      0 |
|  3 |    201506 |      0 |
|  3 |    201507 |      1 |
|  3 |    201508 |      2 |
|  3 |    201509 |      3 |
|  3 |    201510 |      4 |
+----+-----------+--------+

There are multiple rows for each ID, MonthYear is of the form yyyymm, and Status is the status at each MonthYear (takes values 0 to 6)

I have manged to create columns to show me the cumulative maximum status, and an ever3 (to show me if an ID has ever had a status or 3 or more regardless of current status) indicator like this:

data1['Max_Stat'] = data1.groupby(['Custno'])['Status'].cummax()

data1['Ever3'] = np.where(data1['Max_Stat'] >= 3, 1, 0)

What I would also like to do, is create the other columns to create metrics such as the number of times something has happened, or how long since an event. For example

Times3Plus : To show how many times the ID has had a status 3 or more at that point in time

Into3 : Set to Y the first time the ID has a status of 3 or more (not for subsequent times)

+----+-----------+--------+----------+-------+------------+-------+
| ID | YearMonth | Status | Max_Stat | Ever3 | Times3Plus | Into3 |
+----+-----------+--------+----------+-------+------------+-------+
|  1 |    201506 |      0 |        0 |     0 |          0 |       |
|  1 |    201507 |      0 |        0 |     0 |          0 |       |
|  1 |    201508 |      0 |        0 |     0 |          0 |       |
|  1 |    201509 |      0 |        0 |     0 |          0 |       |
|  1 |    201510 |      0 |        0 |     0 |          0 |       |
|  2 |    201506 |      0 |        0 |     0 |          0 |       |
|  2 |    201507 |      1 |        1 |     0 |          0 |       |
|  2 |    201508 |      2 |        2 |     0 |          0 |       |
|  2 |    201509 |      3 |        3 |     1 |          1 | Y     |
|  2 |    201510 |      0 |        3 |     1 |          1 |       |
|  3 |    201506 |      0 |        0 |     0 |          0 |       |
|  3 |    201507 |      1 |        1 |     0 |          0 |       |
|  3 |    201508 |      2 |        2 |     0 |          0 |       |
|  3 |    201509 |      3 |        3 |     1 |          1 | Y     |
|  3 |    201510 |      4 |        4 |     1 |          2 |       |
+----+-----------+--------+----------+-------+------------+-------+

I can do this quite easily in SAS, using BY and RETAIN statements, but can't work out how to replicate this in Python.

See the transform method of a grouped Pandas dataframe: pandas.pydata.org/pandas-docs/stable/… — attitude_stool
– attitude_stool, Commented Feb 16, 2016 at 2:31
Can you post a sample of your data and the expected results you'd like to see for that sample? In general, @attitude_stool is right. You probably want to use groupby(...).transform(...) — Paul H
– Paul H, Commented Feb 16, 2016 at 2:54
Thanks, I have edited my question to include a sample of the data and the expected results — user5932720
– user5932720, Commented Feb 16, 2016 at 4:02

user5932720 · Accepted Answer · 2016-02-16 11:36:56Z

1

I have managed to do this without iterating over each row, as I'm not sure what I was trying to do was possible. I had wanted to set up counters or indicators at group level,as is possible in SAS, and modify these row by row. Eg something like

Times3Plus=0
if row['Status'] >= 3:
    Times3Plus += 1
Return Times3Plus

In the end, I created a binary 3Plus indicator

data['3Plus'] = np.where(data1['Status'] >= 3, 1, 0)

Then used groupby to summarise these to create Times3Plus at group level

data['Times3Plus'] = data.groupby(['ID'])['3Plus'].cumsum()

Into3 could then be populated using a function

def into3(row):
    if row['3Plus'] == 1 and row['Times3Plus'] == 1:  #i.e it is the first time
        return 1

 data['Into3'] = data.apply(into3, axis = 1)

answered Feb 16, 2016 at 11:36

user5932720

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Iterate through rows of grouped pandas dataframe to create new columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related