Modifying data in DataFrame based on date and string length

Question

I need to cleanup some data in a Pandas DataFrame and struggling with this.

Sample data:

Date       | ID     | Name             | Address
-----------------------------------------------------------------------------------------------
1-4-1987   | 124578 | T.Hilpert        | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo

What I want to do is this. Group by ID, get the name from the most recent date AND get the longest address string. Use these for all occurrences of the ID (in two new columns: Name_new and Address_New). Pleas find desired sample below:

Date       | ID     | Name             | Address                                                | Name_New         | Address_New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo                            | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
1-4-1987   | 124578 | T. Hilpert       | 518 Hessel Plaza Lake Lonzo, AZ 11863                  | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury                | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh                    | Earline Schulist | 57367 Alfredo Vista East Bertaburgh

I've tried this but cant get it combined to get the desired result.

def f1(s):
    return max(s, key=len)

df_new = df['New_Address'] = df.groupby('ID').agg({'Address': f1})


df_new = df[df.groupby('ID').Date.transform('max') == df['Date']]

Your help is particularly appreciated.

jezrael · Accepted Answer · 2018-12-18 09:20:43Z

1

Use transform for return Series with same size like original DataFrame, then create index by Name column and get value by maximal Date by idxmax:

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df['Address_New'] = df.groupby('ID')['Address'].transform(lambda s: max(s, key=len))
df['Name_New'] = df.set_index('Name').groupby('ID')['Date'].transform('idxmax').values
print (df)
        Date      ID              Name  \
0 1987-04-01  124578         T.Hilpert   
1 1990-06-23  947383   Birdie Reynolds   
2 1960-05-12  746732  Earline Schulist   
3 2010-09-09  947383   Birdie Reynolds   
4 2017-12-27  124578  Theresia Hilpert   

                                             Address  \
0              518 Hessel Plaza Lake Lonzo, AZ 11863   
1            964 Weissnat Green Suite 568 Rennerbury   
2                57367 Alfredo Vista East Bertaburgh   
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   
4                        518 Hessel Plaza Lake Lonzo   

                                         Address_New          Name_New  
0              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert  
1  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
2                57367 Alfredo Vista East Bertaburgh  Earline Schulist  
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
4              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert

edited Dec 18, 2018 at 9:20

answered Dec 18, 2018 at 8:26

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

John Doe Over a year ago

I managed to change the addresses to the new New_Address column. I'd like to do the same for the Name (not creating a new DataFrame. I used it just for testing my code. I tried df['New_Name'] = df.groupby('ID')['Name'].Date.transform('max') == df['Date']. But then I've an error : AttributeError: 'SeriesGroupBy' object has no attribute 'Date'

jezrael Over a year ago

@JohnDoe - I think you are close, need df['New_Name'] = df.groupby('ID')['Name'].transform(lambda s: max(s, key=len))

John Doe Over a year ago

I want to get the name from the most recent date. For the address it's just the longest string.

jezrael Over a year ago

@JohnDoe - Not sure if understand - df.groupby('ID').Date.transform('max') return latest dates, df.groupby('ID').Name.transform('max') working too, it return returns the one that is at the bottom of the alphabetic list (see this) So need maximal length? Can you explain more?

John Doe Over a year ago

Many thanks. It works excellently!! And also thanks for the links for the explanation,

|

Collectives™ on Stack Overflow

Modifying data in DataFrame based on date and string length

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related