1

I have a csv file with multiple columns of which I want to return these 4 columns. There exist duplicates in the Case_ID column with a different start and or completion time. The goal is to return the case_id with the min start_time and max completion_time and the corresponding casetype.

Case_ID         Start_time   Completion_time   Casetype 

BZ-06-0501-001  2006-10-24  2007-06-15         'string'
BZ-06-0507-001  2006-10-30  2007-06-18         'string'
BZ-06-0508-001  2006-10-30  2008-09-08         'string'
BZ-06-0509-001  2006-10-30  2008-09-08         'string'

This is the code I use

def f():
    appeals = pd.read_csv('appeals')
    min_max = appeals[['Case_ID','Start_time','Completion_time']].groupby('Case_ID').agg({'Start_time' : ['min'], 'Completion_time' : ['max']}).assign(Casetype = appeals['Casetype'])
    return min_max 
    

The problem is that the casetype column returns as nan. So the column the values in the column are not seen as a string anymore. Did I use assign() wrongly?

2
  • are the casetypes all the same for a given case_ID? Commented Dec 11, 2020 at 18:23
  • Yes, they are indeed Commented Dec 11, 2020 at 18:27

1 Answer 1

2

One possible solution:

min_max = appeals.groupby('Case_ID').agg({'Start_time' : 'min', 'Completion_time' : 'max', 'Casetype': 'first'}).reset_index()

I removed your selection of columns ([['Case_ID','Start_time','Completion_time']]) in order to keep all the columns, including Casetype. Then, I added Casetype in the aggregation, with the selection of the first occurence, as you said all Casetype are the same for a given Case_ID. Then, I removed the .assign().

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.