0

I am trying to create a new pandas dataframe based on conditions. This is the original dataframe:

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name4    4      4
name5    2      4

I want to select arbitrary rows so that topic1 == 4 appears 2 times and topic2 == 4 appears 3 times in the new dataframe. Once this is fulfilled, I want to stop the code.

bucket1_topic1 = 2
bucket1_topic2 = 3

I wrote this pretty convoluted starter that is 'almost' working...But I am having issues in dealing with rows that fulfil the conditions for both topic1 and topic2. What is the more efficent & correct way to do this?

rows_list = []

counter1 = 0
counter2 = 0

for index,row in data.iterrows():
    if counter1 < bucket1_topic1:
        if row.topic1 == 4:
            counter1 +=1
            rows_list.append([row[1], row.topic1, row.topic2])

    if counter2 < bucket1_topic2:
        if row.topic2 == 4 and row.topic1 !=4:
            counter2 +=1
            if [row[1], row.topic1, row.topic2] not in rows_list:
                rows_list.append([row[1], row.topic1, row.topic2])

Desired result, where topic1 == 4 appears twice and topic2 == 4 appears 3 times:

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name5    2      4
2
  • can you please describe the condition a little more clearly, I need a count of 3 values for topic1 and 4 values for topic2 in bucket 1 if they fullfil the condition that they are 4 in my bucket is a little bit confusing Commented Feb 2, 2020 at 13:56
  • made some updates, hope this is clearer now! Commented Feb 2, 2020 at 14:12

1 Answer 1

1

Avoid looping and consider reshuffling rows arbitrarily with DataFrame.sample (where frac=1 means return 100% fraction of data frame), then calculate running group counts using groupby().cumcount(). Finally, filter with logical subsetting:

df = (df.sample(frac=1)
        .assign(t1_grp = lambda x: x.groupby(["topic1"]).cumcount(),
                t2_grp = lambda x: x.groupby(["topic2"]).cumcount())
     )

final_df = df[(df["topic1"].isin([1,2,3])) | 
              (df["topic2"].isin([1,2,3])) |
              ((df["topic1"] == 4) & (df["t1_grp"] < 2)) |
              ((df["topic2"] == 4) & (df["t2_grp"] < 3))]

final_df = final_df.drop(columns=["t1_grp", "t2_grp"])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.