1

I got large data samples (1.6 million rows each) where I wish to delete all rows which does not fit certain conditions.

I do have over 1400 different conditions which are tested if they should be applied and once applied I use following code to delete them (with provided random example of data sample):

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1,100,size=(1600000, 13)), columns=list('ABCDEFGHIJKLM'))

cols = ['A','B','C','D','E','F','G','H','I','J','K','L','M']


df['Conditions'] = df[(df[cols] >= 30) & (df[cols] <= 50)].count(axis=1)
df = df[(df["Conditions"] >= 2) & (df["Conditions"] <= 6)]

So for this example-loop. Values between 30 and 50 should occur min 2 but max 6 times per row (all conditions are similar but with different values) My problem is that this takes very long time and since I got 1200 different data samples I'd like to find any way to speed up the process. Do you have any suggestions of method to increase the speed of this? I've also tried df.drop but I experience this as faster. Appreciate all suggestions.

2
  • Too much data for Python, I don't think it will work fast anyway without supercomputer) But I may be mistaken. I have it running in 1.9 secs, Python 3.7 64x, Jupyter Notebook. I'd do it in the same way, but I don't think it will work superfast anyway. Only if filter it separately. Commented Jul 22, 2022 at 8:34
  • 1
    Do you need the Condition column attached to the data? if not, make it a separate series to avoid possible data copy. Commented Jul 22, 2022 at 8:36

2 Answers 2

1

I just realized that you used count, which is slower because you need to copy your data on the mask. I'd suggest you use sum on the logic:

mask = ((df[cols] >= 30) & (df[cols] <= 50)).sum(axis=1)
df = df[mask.between(2,6)]

This takes about 400ms on my system whereas your approach takes about 1s (including my commented suggestion, without it, it's about 2s).

Sign up to request clarification or add additional context in comments.

Comments

1

My idea is to:

  1. Generate each condition as a boolean list.
  2. Concatenate them into an array named cond (each condition creates a row).
  3. Compute "final" criterion for each source row, i.e. for each column in cond. E.g. leave source rows which meet all "partial" criteria. This way you have a single list (True of False for each row).
  4. Get rows meeting this (final) criterion, using boolean indexing.

Let's generate the source array a bit smaller and with generator seeding, to get repeatable results:

nRows = 30    # Number of rows
cols = ['A','B','C','D','E','F','G','H','I','J','K','L','M']  # Column names
rng = np.random.default_rng(0)  # Seed
# Generate the source DataFrame
df = pd.DataFrame(rng.integers(1, 100, (nRows, len(cols))), columns=cols)

Then assume that you have just 2 "partial" criteria for each row:

  • between 30 and 50, the number of elements in this range must be between 3 and 5 elements of a row (your original criterion),
  • between 50 and 70, the number of elements in this range must be also between 3 and 5.

To express this in code, write:

cond = np.vstack((
        ((df[cols] >=30) & (df[cols] <= 50)).sum(axis=1).between(3,5),
        ((df[cols] >=50) & (df[cols] <= 70)).sum(axis=1).between(3,5)
    )).all(axis=0)

You can add here your other criteria.

Steps:

  • Each row computes a partial criterion (a boolean vector).
  • vstack stacks them vertically, so that we have as many rows as the number of partial criteria and as many columns as source rows.
  • any(axis=0) - converts this array into a single boolean vector (one element per source row).

Then, to get the final result, run:

result = df[cond]

The result is:

     A   B   C   D   E   F   G   H   I   J   K   L   M
3   42  40   3   1  13   1  67  53  65  26  61  76  38
4   46  99  80  98  38  68  95  65  84  69  70  39  87
12  60  94   2  46  83  76  41  50  42  53  23  78   8
20  80  59  85  55  76  81   7  56  46  29  45  41  49
25  49  49  97  69  77   1  31  98  27  51  86  64  88

The speed gain of this solution is that you perform selection from the source array only once.

Your original solution (as I suppose) to narrow down the initial DataFrame and each time create a new DataFrame (over 1000 times) causes, that such code runs slow.

My solution should work faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.