I'm new to Python and am trying to get to grips with Pandas for data analysis.
I wondered if anyone can help me loop through rows of grouped data in a dataframe to create new variables.
Suppose I have a dataframe called data, that looks like this:
+----+-----------+--------+ | ID | YearMonth | Status | +----+-----------+--------+ | 1 | 201506 | 0 | | 1 | 201507 | 0 | | 1 | 201508 | 0 | | 1 | 201509 | 0 | | 1 | 201510 | 0 | | 2 | 201506 | 0 | | 2 | 201507 | 1 | | 2 | 201508 | 2 | | 2 | 201509 | 3 | | 2 | 201510 | 0 | | 3 | 201506 | 0 | | 3 | 201507 | 1 | | 3 | 201508 | 2 | | 3 | 201509 | 3 | | 3 | 201510 | 4 | +----+-----------+--------+
There are multiple rows for each ID, MonthYear is of the form yyyymm, and Status is the status at each MonthYear (takes values 0 to 6)
I have manged to create columns to show me the cumulative maximum status, and an ever3 (to show me if an ID has ever had a status or 3 or more regardless of current status) indicator like this:
data1['Max_Stat'] = data1.groupby(['Custno'])['Status'].cummax()
data1['Ever3'] = np.where(data1['Max_Stat'] >= 3, 1, 0)
What I would also like to do, is create the other columns to create metrics such as the number of times something has happened, or how long since an event. For example
Times3Plus : To show how many times the ID has had a status 3 or more at that point in time
Into3 : Set to Y the first time the ID has a status of 3 or more (not for subsequent times)
+----+-----------+--------+----------+-------+------------+-------+ | ID | YearMonth | Status | Max_Stat | Ever3 | Times3Plus | Into3 | +----+-----------+--------+----------+-------+------------+-------+ | 1 | 201506 | 0 | 0 | 0 | 0 | | | 1 | 201507 | 0 | 0 | 0 | 0 | | | 1 | 201508 | 0 | 0 | 0 | 0 | | | 1 | 201509 | 0 | 0 | 0 | 0 | | | 1 | 201510 | 0 | 0 | 0 | 0 | | | 2 | 201506 | 0 | 0 | 0 | 0 | | | 2 | 201507 | 1 | 1 | 0 | 0 | | | 2 | 201508 | 2 | 2 | 0 | 0 | | | 2 | 201509 | 3 | 3 | 1 | 1 | Y | | 2 | 201510 | 0 | 3 | 1 | 1 | | | 3 | 201506 | 0 | 0 | 0 | 0 | | | 3 | 201507 | 1 | 1 | 0 | 0 | | | 3 | 201508 | 2 | 2 | 0 | 0 | | | 3 | 201509 | 3 | 3 | 1 | 1 | Y | | 3 | 201510 | 4 | 4 | 1 | 2 | | +----+-----------+--------+----------+-------+------------+-------+
I can do this quite easily in SAS, using BY and RETAIN statements, but can't work out how to replicate this in Python.
transformmethod of a grouped Pandas dataframe: pandas.pydata.org/pandas-docs/stable/…groupby(...).transform(...)