1

I'm having a problem filtering out duplicate data based on a key ticker in columns based on conditionals with lowest values(int & dates). So, the initial dataset looks like the following:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16  12/20/16    12/20/17    -81
1   AA        ART      9/30/16   12/1/16     12/1/17    -62
2   AA        ART      9/30/16   12/1/16      2/8/18   -131
3   AA        ART      9/30/16    2/8/17     12/1/17    -62
4   AA        ART      9/30/16    2/8/17      2/8/18   -131
5   AABA      ART      9/30/16   11/9/16     11/9/17    -40
6   AAC       ART      9/30/16   11/8/16     11/8/17    -39
7   AAL       ART      9/30/16  10/20/16    10/20/17    -20
8   AAMC      ART      9/30/16   11/7/16     11/7/17    -38
9   AAME      ART      9/30/16  11/14/16    11/14/17    -45
36  ABMT      ART      9/30/16   2/14/17     2/14/18    -137
37  ABMT      ART      9/30/16   2/14/17     2/16/18    -139
38  ABMT      ART      9/30/16   2/16/17     2/14/18    -137

Notice, value AA is repeated 4 times and the value ABMT is repeated 3 times. I would like to filter out some of the values based on two conditions, the first selects the date0 dates which came first, so now the dataset will look like this:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
2   AA        ART      9/30/16    12/1/16     2/8/18   -131
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137
37  ABMT      ART      9/30/16    2/14/17    2/16/18    -139

The second condition is to remove the values with the lowest diff value to get the final result. Now the filtered, complete dataset will look like this:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137

Thank you for your help.


EDIT:

After Wen's answer, I've update my code to the following:

import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)

returns:

    Unnamed: 0 ticker  dim cal_date     date0     date1  diff
 0           0      A  ART  9/30/16  12/20/16  12/20/17   -81
 1           1     AA  ART  9/30/16   12/1/16   12/1/17   -62
 2           2     AA  ART  9/30/16   12/1/16    2/8/18  -131
 3           3     AA  ART  9/30/16    2/8/17   12/1/17   -62
 4           4     AA  ART  9/30/16    2/8/17    2/8/18  -131
 5           5   AABA  ART  9/30/16   11/9/16   11/9/17   -40
 6           6    AAC  ART  9/30/16   11/8/16   11/8/17   -39
 7           7    AAL  ART  9/30/16  10/20/16  10/20/17   -20
 8           8   AAMC  ART  9/30/16   11/7/16   11/7/17   -38
 9           9   AAME  ART  9/30/16  11/14/16  11/14/17   -45
10          36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137
11          37   ABMT  ART  9/30/16   2/14/17   2/16/18  -139
12          38   ABMT  ART  9/30/16   2/16/17   2/14/18  -137

Then I add:

# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)

data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)

Which returns:

    Unnamed: 0 ticker  dim cal_date      date0     date1   diff
 0           0      A  ART  9/30/16 2016-12-20  12/20/17  -81.0
 1           1     AA  ART  9/30/16 2016-12-01   12/1/17  -62.0
 2           2     AA  ART  9/30/16 2016-12-01    2/8/18 -131.0
 3           3     AA  ART  9/30/16 2017-02-08   12/1/17  -62.0
 4           4     AA  ART  9/30/16 2017-02-08    2/8/18 -131.0
 5           5   AABA  ART  9/30/16 2016-11-09   11/9/17  -40.0
 6           6    AAC  ART  9/30/16 2016-11-08   11/8/17  -39.0
 7           7    AAL  ART  9/30/16 2016-10-20  10/20/17  -20.0
 8           8   AAMC  ART  9/30/16 2016-11-07   11/7/17  -38.0
 9           9   AAME  ART  9/30/16 2016-11-14  11/14/17  -45.0
10          36   ABMT  ART  9/30/16 2017-02-14   2/14/18 -137.0
11          37   ABMT  ART  9/30/16 2017-02-14   2/16/18 -139.0
12          38   ABMT  ART  9/30/16 2017-02-16   2/14/18 -137.0

So unfortunately, so far, no luck.

2
  • 1
    should AA -131 be removed ? Commented Mar 25, 2018 at 20:07
  • Yes, AA -131 (row 2), I'll edit it. Commented Mar 25, 2018 at 20:09

1 Answer 1

3

Then sort_values + drop_duplicates

df.sort_values(['date0','diff'],ascending=[False,True]).drop_duplicates('ticker',keep='last').sort_index()
Out[1071]: 
   ticker  dim cal_date     date0     date1  diff
0       A  ART  9/30/16  12/20/16  12/20/17   -81
1      AA  ART  9/30/16   12/1/16   12/1/17   -62
5    AABA  ART  9/30/16   11/9/16   11/9/17   -40
6     AAC  ART  9/30/16   11/8/16   11/8/17   -39
7     AAL  ART  9/30/16  10/20/16  10/20/17   -20
8    AAMC  ART  9/30/16   11/7/16   11/7/17   -38
9    AAME  ART  9/30/16  11/14/16  11/14/17   -45
36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your reply, unfortunately, it doesn't seem to work. I'll edit my answer with a sample of my code so that you can see what it's returning.
@michael0196 you forget to assign it back data=data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
Ohh.. Sorry, I'm an idiot haha. Still new to python. Thank you so much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.