Efficient selection of rows in Pandas dataframe based on multiple conditions across columns

Question

I am trying to create a new pandas dataframe based on conditions. This is the original dataframe:

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name4    4      4
name5    2      4

I want to select arbitrary rows so that topic1 == 4 appears 2 times and topic2 == 4 appears 3 times in the new dataframe. Once this is fulfilled, I want to stop the code.

bucket1_topic1 = 2
bucket1_topic2 = 3

I wrote this pretty convoluted starter that is 'almost' working...But I am having issues in dealing with rows that fulfil the conditions for both topic1 and topic2. What is the more efficent & correct way to do this?

rows_list = []

counter1 = 0
counter2 = 0

for index,row in data.iterrows():
    if counter1 < bucket1_topic1:
        if row.topic1 == 4:
            counter1 +=1
            rows_list.append([row[1], row.topic1, row.topic2])

    if counter2 < bucket1_topic2:
        if row.topic2 == 4 and row.topic1 !=4:
            counter2 +=1
            if [row[1], row.topic1, row.topic2] not in rows_list:
                rows_list.append([row[1], row.topic1, row.topic2])

Desired result, where topic1 == 4 appears twice and topic2 == 4 appears 3 times:

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name5    2      4

can you please describe the condition a little more clearly, I need a count of 3 values for topic1 and 4 values for topic2 in bucket 1 if they fullfil the condition that they are 4 in my bucket is a little bit confusing — gold_cy
– gold_cy, Commented Feb 2, 2020 at 13:56

Parfait · Accepted Answer · 2020-02-02 14:46:52Z

1

Avoid looping and consider reshuffling rows arbitrarily with DataFrame.sample (where frac=1 means return 100% fraction of data frame), then calculate running group counts using groupby().cumcount(). Finally, filter with logical subsetting:

df = (df.sample(frac=1)
        .assign(t1_grp = lambda x: x.groupby(["topic1"]).cumcount(),
                t2_grp = lambda x: x.groupby(["topic2"]).cumcount())
     )

final_df = df[(df["topic1"].isin([1,2,3])) | 
              (df["topic2"].isin([1,2,3])) |
              ((df["topic1"] == 4) & (df["t1_grp"] < 2)) |
              ((df["topic2"] == 4) & (df["t2_grp"] < 3))]

final_df = final_df.drop(columns=["t1_grp", "t2_grp"])

answered Feb 2, 2020 at 14:46

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Efficient selection of rows in Pandas dataframe based on multiple conditions across columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related