I got large data samples (1.6 million rows each) where I wish to delete all rows which does not fit certain conditions.
I do have over 1400 different conditions which are tested if they should be applied and once applied I use following code to delete them (with provided random example of data sample):
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,100,size=(1600000, 13)), columns=list('ABCDEFGHIJKLM'))
cols = ['A','B','C','D','E','F','G','H','I','J','K','L','M']
df['Conditions'] = df[(df[cols] >= 30) & (df[cols] <= 50)].count(axis=1)
df = df[(df["Conditions"] >= 2) & (df["Conditions"] <= 6)]
So for this example-loop. Values between 30 and 50 should occur min 2 but max 6 times per row (all conditions are similar but with different values) My problem is that this takes very long time and since I got 1200 different data samples I'd like to find any way to speed up the process. Do you have any suggestions of method to increase the speed of this? I've also tried df.drop but I experience this as faster. Appreciate all suggestions.
Conditioncolumn attached to the data? if not, make it a separate series to avoid possible data copy.