0

I have a huge df that looks like this:

date stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
10/20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
11/20 0.8 0.9 0.3 0.4 0.3 0.5 0.3 0.2 0.4 0.1
12/20 0.3 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1

I want to find, for each row, the 20% higher values of stocks and the 20% lower. The output should be:

date higher lower
10/20 stock9, stock 10 stock1, stock 2
11/20 stock1, stock 2 stock8, stock 10
12/20 stock3, stock 9 stock1, stock 10

I do not need to have the comma between the values above, could be one below the other. I have tried df= df.stack() for stacking and later rank the values inside the columns, but I do not know how to proceed.

enter image description here

2
  • What do you mean by "20% higher values"? Do you want the 2 highest and 2 lowest? Commented Feb 16, 2022 at 19:36
  • I mean the 20% of the highest. In this case, it is the 2 highest and 2 lowest because there are only 10 values. But in my original I have around 2000. Commented Feb 16, 2022 at 19:39

2 Answers 2

3

Try with nlargest and nsmallest:

#df = df.set_index("date") #uncomment if date is a column and not the index
n = round(len(df.columns)*0.2) #number of stocks in the top/bottom 20%

output = pd.DataFrame()
output["higher"] = df.apply(lambda x: x.nlargest(n).index.tolist(), axis=1)
output["lower"] = df.apply(lambda x: x.nsmallest(n).index.tolist(), axis=1)

>>> output
                  higher              lower
date                                       
10/20  [stock9, stock10]   [stock1, stock2]
11/20   [stock2, stock1]  [stock10, stock8]
12/20   [stock3, stock9]  [stock10, stock1]

Edit: If you want each stock name on a separate line, you can do:

output = pd.DataFrame()
output["higher"] = df.apply(lambda x: "\n".join(x.nlargest(n).index.tolist()), axis=1)
output["lower"] = df.apply(lambda x: "\n".join(x.nsmallest(n).index.tolist()), axis=1)
Sign up to request clarification or add additional context in comments.

5 Comments

I get the "KeyError: "None of ['date'] are in the columns". I executed "df.columns" and there is no column called "date". However, I can see it when I display the dataset. I am going to add a screenshot on my post.
It is probably the index. Ignore the first line and do the rest then
Now it has worked, thanks. How could I have the formation in a way that the names are not divided by comma, but is each of them a line?
@user17717499 - I edited my answer but I would suggest not to do that as accessing the elements later on will be easier with the list format.
That is one more thing I have not taken into account before: the number of observations for each row are different, and so, the "size" is different for each row. For example, if a row has 100 observations, I need 20 values in the "higher" and 20 in the "lower". But if the row has 200 observations, then I will have double amount of values for "higher" and "lower". Do you know how could I integrate that into the code?
1

You can do it with a helper function that sorts values for each row:

def get_top_bottom_20_pct(x):
    d = x.sort_values(ascending=False).index.tolist()
    return [*map(', '.join, (d[:size], d[-size:]))]

size = int(0.2 * df.shape[1])
s = df.set_index('date').apply(get_top_bottom_20_pct, axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

If you have Python >=3.8, you can do the same with the walrus operator:

s = df.set_index('date').apply(lambda x: (', '.join((d := x.sort_values(ascending=False).index.tolist())[:size]), 
                                          ', '.join(d[-size:])), axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

Output:

    date           higher            lower
0  10/20  stock9, stock10   stock2, stock1
1  11/20   stock2, stock1  stock8, stock10
2  12/20   stock3, stock9  stock1, stock10

1 Comment

That is one more thing I have not taken into account before: the number of observations for each row are different, and so, the "size" is different for each row. For example, if a row has 100 observations, I need 20 values in the "higher" and 20 in the "lower". But if the row has 200 observations, then I will have double amount of values for "higher" and "lower". Do you know how could I integrate that into the code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.