multiple entries with same index columns in pandas manipulation

Question

I have the following data. Note for each combination of FPEDATS and CUSIP, there are multiple STATPERS.

CUSIP,STATPERS,NUMEST,FPEDATS
87482X10,20140417,4,20141231
87482X10,20140515,4,20141231
87482X10,20140619,5,20141231
12345678,20140717,3,20141231
12345678,20140814,5,20141231
87482X10,20140417,1,20151231
87482X10,20140515,2,20151231
87482X10,20140619,3,20151231
12345678,20140717,4,20151231
12345678,20140814,5,20151231

If I use panda.read_csv with columns = ["FPEDATS", "CUSIP"]. I get the following.

                      STATPERS    NUMEST
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

For each combination of CUSIP and FPEDATS, I am interested in:

can we only keep the one with the the largest date stamp and delete the other rows? (Basically filter by largest STATPERS)
can we calculate an average of the the values of NUMEST for each combination of indices, either save this or replace the current data with it?
can we, for example, assign a new attribute, for example "price" which is common to each combination of CUSIP and FPEDATS.

For 2, I mean something like:

                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14      5
         2015-12-31 2014-08-14      5
87482X10 2014-12-31 2014-06-19      5
         2015-12-31 2014-06-19      3

For 3, I mean I want something like

                      STATPERS  NUMEST  Price
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4   2
         2014-12-31 2014-05-15       4   2
         2014-12-31 2014-06-19       5   2
12345678 2014-12-31 2014-07-17       3   4
         2014-12-31 2014-08-14       5   4
87482X10 2015-12-31 2014-04-17       1   5
         2015-12-31 2014-05-15       2   5
         2015-12-31 2014-06-19       3   5
12345678 2015-12-31 2014-07-17       4   7
         2015-12-31 2014-08-14       5   7

when I only want to parse in [("CUSIP", "FPEDATS", "PRICE")] (as a list of tuple or a Panda Series)? Is there a slick way of doing that?

So far, the only way I came up with solving all of these problems is to use csv reader in the csv package to read the csvs, then throw everything back to panda format. I hope there is a simpler way of doing this.

UPDATE: I managed to do 2 by a MultiIndexing approach. basically calculate The MultiIndex by using

c = analyst_file.groupby(["CUSIP", "FPEDATS"])["STATPERS"].max()

Then we write these indices as a list of tuples and use slicing. Still would like do things a bit more slickly.

MaxU - stand with Ukraine · Accepted Answer · 2017-07-16 16:00:52Z

1

assuming we have the following DF:

In [193]: df
Out[193]:
                      STATPERS  NUMEST
CUSIP    FPEDATS
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

we can do the following:

In [192]: df.groupby(level=[0,1]).agg({'STATPERS':'max','NUMEST':'mean'})
Out[192]:
                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14  4.000000
         2015-12-31 2014-08-14  4.500000
87482X10 2014-12-31 2014-06-19  4.333333
         2015-12-31 2014-06-19  2.000000

PS please elaborate on common to each combination of CUSIP and FPEDATS - it's not clear...

answered Jul 16, 2017 at 16:00

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Lost1 Over a year ago

thank you for your answer, I have updated my question. I found the groupby().max() but that didn't quite do the trick.

Lost1 Over a year ago

maybe I should have made it clearer, what you have written is not quite what I am looking for though I was also interested in doing what you have written. My question 1 and question 2 were separate questions.

Lost1 Over a year ago

STATPERS is unique for each combination of CUSIP and FPEDATS, so STATPERS NUMEST CUSIP alone can be used identify the NUMEST uniquely. I wanted to find the max STATPERS for each combination of CUSIP and FPEDATS. Then find the NUMEST corresponding to each CUSIP, FPEDATS and NUMEST.

MaxU - stand with Ukraine Over a year ago

@Lost1, SO encorauges you to ask one question per post. Beside that it's highly recommended to post a desired data set when asking Numpy/Pandas/TensorFlow/etc. questions... Please read how to make good reproducible pandas examples and edit your post correspondingly.

Lost1 Over a year ago

noted, I will do that in the future. I have updated the question. thanks a lot for your help

|

Collectives™ on Stack Overflow

multiple entries with same index columns in pandas manipulation

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related