0

I have the following data. Note for each combination of FPEDATS and CUSIP, there are multiple STATPERS.

CUSIP,STATPERS,NUMEST,FPEDATS
87482X10,20140417,4,20141231
87482X10,20140515,4,20141231
87482X10,20140619,5,20141231
12345678,20140717,3,20141231
12345678,20140814,5,20141231
87482X10,20140417,1,20151231
87482X10,20140515,2,20151231
87482X10,20140619,3,20151231
12345678,20140717,4,20151231
12345678,20140814,5,20151231

If I use panda.read_csv with columns = ["FPEDATS", "CUSIP"]. I get the following.

                      STATPERS    NUMEST
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

For each combination of CUSIP and FPEDATS, I am interested in:

  1. can we only keep the one with the the largest date stamp and delete the other rows? (Basically filter by largest STATPERS)
  2. can we calculate an average of the the values of NUMEST for each combination of indices, either save this or replace the current data with it?
  3. can we, for example, assign a new attribute, for example "price" which is common to each combination of CUSIP and FPEDATS.

For 2, I mean something like:

                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14      5
         2015-12-31 2014-08-14      5
87482X10 2014-12-31 2014-06-19      5
         2015-12-31 2014-06-19      3

For 3, I mean I want something like

                      STATPERS  NUMEST  Price
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4   2
         2014-12-31 2014-05-15       4   2
         2014-12-31 2014-06-19       5   2
12345678 2014-12-31 2014-07-17       3   4
         2014-12-31 2014-08-14       5   4
87482X10 2015-12-31 2014-04-17       1   5
         2015-12-31 2014-05-15       2   5
         2015-12-31 2014-06-19       3   5
12345678 2015-12-31 2014-07-17       4   7
         2015-12-31 2014-08-14       5   7

when I only want to parse in [("CUSIP", "FPEDATS", "PRICE")] (as a list of tuple or a Panda Series)? Is there a slick way of doing that?

So far, the only way I came up with solving all of these problems is to use csv reader in the csv package to read the csvs, then throw everything back to panda format. I hope there is a simpler way of doing this.

UPDATE: I managed to do 2 by a MultiIndexing approach. basically calculate The MultiIndex by using

c = analyst_file.groupby(["CUSIP", "FPEDATS"])["STATPERS"].max()

Then we write these indices as a list of tuples and use slicing. Still would like do things a bit more slickly.

1 Answer 1

1

assuming we have the following DF:

In [193]: df
Out[193]:
                      STATPERS  NUMEST
CUSIP    FPEDATS
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

we can do the following:

In [192]: df.groupby(level=[0,1]).agg({'STATPERS':'max','NUMEST':'mean'})
Out[192]:
                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14  4.000000
         2015-12-31 2014-08-14  4.500000
87482X10 2014-12-31 2014-06-19  4.333333
         2015-12-31 2014-06-19  2.000000

PS please elaborate on common to each combination of CUSIP and FPEDATS - it's not clear...

Sign up to request clarification or add additional context in comments.

7 Comments

thank you for your answer, I have updated my question. I found the groupby().max() but that didn't quite do the trick.
maybe I should have made it clearer, what you have written is not quite what I am looking for though I was also interested in doing what you have written. My question 1 and question 2 were separate questions.
STATPERS is unique for each combination of CUSIP and FPEDATS, so STATPERS NUMEST CUSIP alone can be used identify the NUMEST uniquely. I wanted to find the max STATPERS for each combination of CUSIP and FPEDATS. Then find the NUMEST corresponding to each CUSIP, FPEDATS and NUMEST.
@Lost1, SO encorauges you to ask one question per post. Beside that it's highly recommended to post a desired data set when asking Numpy/Pandas/TensorFlow/etc. questions... Please read how to make good reproducible pandas examples and edit your post correspondingly.
noted, I will do that in the future. I have updated the question. thanks a lot for your help
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.