Ranking row values from a multiplus row dataframe

Question

I have a huge df that looks like this:

date	stock1	stock2	stock3	stock4	stock5	stock6	stock7	stock8	stock9	stock10
10/20	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.9
11/20	0.8	0.9	0.3	0.4	0.3	0.5	0.3	0.2	0.4	0.1
12/20	0.3	0.6	0.9	0.5	0.6	0.7	0.8	0.7	0.9	0.1

I want to find, for each row, the 20% higher values of stocks and the 20% lower. The output should be:

date	higher	lower
10/20	stock9, stock 10	stock1, stock 2
11/20	stock1, stock 2	stock8, stock 10
12/20	stock3, stock 9	stock1, stock 10

I do not need to have the comma between the values above, could be one below the other. I have tried df= df.stack() for stacking and later rank the values inside the columns, but I do not know how to proceed.

What do you mean by "20% higher values"? Do you want the 2 highest and 2 lowest? — not_speshal
– not_speshal, Commented Feb 16, 2022 at 19:36
I mean the 20% of the highest. In this case, it is the 2 highest and 2 lowest because there are only 10 values. But in my original I have around 2000. — user17717499
– user17717499, Commented Feb 16, 2022 at 19:39

not_speshal · Accepted Answer · 2022-02-16 20:23:41Z

3

Try with nlargest and nsmallest:

#df = df.set_index("date") #uncomment if date is a column and not the index
n = round(len(df.columns)*0.2) #number of stocks in the top/bottom 20%

output = pd.DataFrame()
output["higher"] = df.apply(lambda x: x.nlargest(n).index.tolist(), axis=1)
output["lower"] = df.apply(lambda x: x.nsmallest(n).index.tolist(), axis=1)

>>> output
                  higher              lower
date                                       
10/20  [stock9, stock10]   [stock1, stock2]
11/20   [stock2, stock1]  [stock10, stock8]
12/20   [stock3, stock9]  [stock10, stock1]

Edit: If you want each stock name on a separate line, you can do:

output = pd.DataFrame()
output["higher"] = df.apply(lambda x: "\n".join(x.nlargest(n).index.tolist()), axis=1)
output["lower"] = df.apply(lambda x: "\n".join(x.nsmallest(n).index.tolist()), axis=1)

edited Feb 16, 2022 at 20:23

answered Feb 16, 2022 at 19:42

not_speshal

23.2k2 gold badges18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user17717499 Over a year ago

I get the "KeyError: "None of ['date'] are in the columns". I executed "df.columns" and there is no column called "date". However, I can see it when I display the dataset. I am going to add a screenshot on my post.

not_speshal Over a year ago

It is probably the index. Ignore the first line and do the rest then

user17717499 Over a year ago

Now it has worked, thanks. How could I have the formation in a way that the names are not divided by comma, but is each of them a line?

not_speshal Over a year ago

@user17717499 - I edited my answer but I would suggest not to do that as accessing the elements later on will be easier with the list format.

user17717499 Over a year ago

That is one more thing I have not taken into account before: the number of observations for each row are different, and so, the "size" is different for each row. For example, if a row has 100 observations, I need 20 values in the "higher" and 20 in the "lower". But if the row has 200 observations, then I will have double amount of values for "higher" and "lower". Do you know how could I integrate that into the code?

score 1 · Accepted Answer · 2022-02-16 20:50:13Z

1

You can do it with a helper function that sorts values for each row:

def get_top_bottom_20_pct(x):
    d = x.sort_values(ascending=False).index.tolist()
    return [*map(', '.join, (d[:size], d[-size:]))]

size = int(0.2 * df.shape[1])
s = df.set_index('date').apply(get_top_bottom_20_pct, axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

If you have Python >=3.8, you can do the same with the walrus operator:

s = df.set_index('date').apply(lambda x: (', '.join((d := x.sort_values(ascending=False).index.tolist())[:size]), 
                                          ', '.join(d[-size:])), axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

Output:

    date           higher            lower
0  10/20  stock9, stock10   stock2, stock1
1  11/20   stock2, stock1  stock8, stock10
2  12/20   stock3, stock9  stock1, stock10

edited Feb 16, 2022 at 20:50

answered Feb 16, 2022 at 19:40

user7864386

1 Comment

user17717499 Over a year ago

That is one more thing I have not taken into account before: the number of observations for each row are different, and so, the "size" is different for each row. For example, if a row has 100 observations, I need 20 values in the "higher" and 20 in the "lower". But if the row has 200 observations, then I will have double amount of values for "higher" and "lower". Do you know how could I integrate that into the code?

Collectives™ on Stack Overflow

Ranking row values from a multiplus row dataframe

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related