how to use groupby to avoid loop in python

Question

There are several columns in the data, three are named "candidate_id", "enddate", "TitleLevel".

Within the same id, if the enddate is the same, I will delete the lower level record.

For example, given:

candidate_id   startdate     enddate   TitleLevel
    1          2012.1.1      2013.5.1     2
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

What I want is:

candidate_id   startdate     enddate   TitleLevel
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

I will delete candidate_id=1, enddate=2013.5.1, and titlelevel=2.

I have come up with a loop.

for i in range(nrow-2,-1, -1):
    if (JobData['enddate'][i] == JobData['enddate'][i+1] 
           and JobData['candidate_id'][i] == JobData['candidate_id'][i+1] 
           and pd.notnull(JobData['enddate'][i]):
        if JobData['TitleLevel'][i] > JobData['TitleLevel'][i+1]:
            JobData= JobData.drop(i+1)
        else:
            JobData= JobData.drop(i)

The loop really takes some time to delete redundant rows. Is there a faster method?

If you can give some test data in the code, it will be easier for you to answer your question. Having said that, groupby is very to use. Just remember to sort the list of data before passing it to the function — Anthony Kong
– Anthony Kong, Commented Nov 20, 2013 at 22:16
It's not just pandas. I'm just trying to find a way to speed up the code, without using for loop and if else. The test data is below "saying". In candidate_id=1, and enddate=2013.5.1, I want to delete the row which TitleLevel is lower. — user3013706
– user3013706, Commented Nov 20, 2013 at 22:19
@user3013706, true, but labeling with pandas is very helpful, because folks familiar with it will see your question — askewchan
– askewchan, Commented Nov 20, 2013 at 22:21
@user3013706 as you use pandas, one can give you advise basing on pandas api, not only using general python builtins — alko
– alko, Commented Nov 20, 2013 at 22:21
The purpose of the code is to build a statistical model. So I read in csv file using pandas. OK, I will put "pandas" in the label :) — user3013706
– user3013706, Commented Nov 20, 2013 at 22:21

alko · Accepted Answer · 2013-11-20 22:54:09Z

2

If you data structure is exactly as you describe, you can use groupby/max:

>>> df
   candidate_id    enddate  TitleLevel
0             1   2013.5.1           2
1             1   2013.5.1           4
2             1   2010.1.1           3
3             2  2012.12.1           2
>>> df.groupby(['candidate_id','enddate']).max().reset_index()
   candidate_id    enddate  TitleLevel
0             1   2010.1.1           3
1             1   2013.5.1           4
2             2  2012.12.1           2

Here groupby groups rows with equal candidate_id and enddate, and max() evaluates maximum TitleLevel within each group. Result is the same as if rows with all other values being dropped.

In case you have more columns,

>>> df
   candidate_id    enddate  TitleLevel other_column
0             1   2013.5.1           2          foo
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

you can get idexes of rows with max values, without sorting if rows order has to be preserved:

>>> idx = df.groupby(['candidate_id','enddate'], sort=False)['TitleLevel'].agg(lambda x: x.idxmax())

and filter needed rows with ix:

>>> df.ix[idx]
   candidate_id    enddate  TitleLevel other_column
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

edited Nov 20, 2013 at 22:54

answered Nov 20, 2013 at 22:24

alko

48.7k12 gold badges99 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3013706 Over a year ago

But if I still want to keep the original order of "enddate". Your code seems sorts the enddate within candidate_id. And also there are some other columns, i just extract these for example..

alko Over a year ago

@user3013706 you can use sort=False param, and ix/idxmax instead of max. see updated code

user3013706 Over a year ago

In my case, i think groupby(['candidate_id','enddate'],sort=False)['TitleLevel'].agg(lambda x:x.max()) is right, bc it doesn't need to max its index. However, after using the code, other columns(except candidate_id, enddate, TitleLevel) are missing.

Andy Hayden · Accepted Answer · 2013-11-20 23:10:48Z

1

Assuming that data is sorted by startdate (at least within each group), you can use groupby last:

In [11]: df.groupby(['candidate_id', 'enddate'], as_index=False).last()
Out[11]: 
   candidate_id    enddate  startdate  TitleLevel
0             1   2010.1.1  2008.12.1           3
1             1   2013.5.1   2011.1.1           4
2             2  2012.12.1  2010.10.1           2

answered Nov 20, 2013 at 23:10

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

1 Comment

user3013706 Over a year ago

Sort takes time, so I didn't sort the data by startdate. What I want to do is keeping the data with highest TitleLevel in the same candidate_id and same enddate. Do you have any ideas to do that but without sorting startdate? Thank you!

Collectives™ on Stack Overflow

how to use groupby to avoid loop in python

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related