How to write Pandas DataFrame filter condition from parameters in a function

Question

Let say I have pandas DataFrame like this:

  NO_DOC | NO_SEQ | DESC
  A1     | 1      | A
  B1     | 2      | B
  C1     | 3      | C

I want to search row index from data above on a certain filter: I am able to find it using statement:

idx = data.index[data["NO_DOC"] == "A1" && data["NO_SEQ"] == 1]

but what I want to ask is how to make the "condition" is variable because I want to make it in a function.

Example:

def find(self, data, columns, value):
  return data.index[data[...] == value[...] && ....].tolist()

columns parameters is ["NO_DOC","NO_SEQ"] value is ["A1", 1]

the filter condition above is: data["NO_DOC"] == "A1" && data["NO_SEQ"] == 1

How do I make a function to find row index on pandas DataFrame with a certain filter condition?

Henry Ecker · Accepted Answer · 2021-12-06 15:12:42Z

2

We can use zip with a list comprehension to build a DataFrame from each Series of boolean values:

mask_df = pd.DataFrame([
    data[col] == val for col, val in zip(columns, values)
])

           0      1      2
NO_DOC  True  False  False
NO_SEQ  True  False  False

Then we can reduce this DataFrame of booleans with all to check which columns (indexes) have only True values (logical AND):

mask = mask_df.all()

0     True
1    False
2    False
dtype: bool

Note: logical OR can be achieved with any instead of all

Now we can use the mask to filter the indexes:

data.index[mask].tolist()

[0]

Together find can look something like:

def find(data, columns, values):
    # Create a DataFrame of boolean conditions
    # Take logical AND using DataFrame.all
    mask = pd.DataFrame([
        data[col] == val for col, val in zip(columns, values)
    ]).all()
    # Filter and convert to list
    return data.index[mask].tolist()

Or with np.logical_and and .reduce which behaves in with the same logic as the DataFrame but can be much faster as it does not need to maintain pandas objects:

def find(data, columns, values):
    mask = np.logical_and.reduce(
        [data[col] == val for col, val in zip(columns, values)]
    )
    return data.index[mask].tolist()

Some timing via timeit.

Setup:

import numpy as np
import pandas as pd


def find_with_df_mask(data, columns, values):
    mask = pd.DataFrame([
        data[col] == val for col, val in zip(columns, values)
    ]).all()
    return data.index[mask].tolist()


def find_with_np_mask(data, columns, values):
    mask = np.logical_and.reduce(
        [data[col] == val for col, val in zip(columns, values)]
    )
    return data.index[mask].tolist()


df = pd.concat([pd.DataFrame({
    'NO_DOC': ['A1', 'B1', 'C1'],
    'NO_SEQ': [1, 2, 3],
    'DESC': ['A', 'B', 'C']
})] * 1000, ignore_index=True)
cols = ["NO_DOC", "NO_SEQ"]
vals = ["A1", 1]

Timings:

%timeit find_with_df_mask(df, cols, vals)
21.9 ms ± 65.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit find_with_np_mask(df, cols, vals)
319 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Dec 6, 2021 at 15:12

answered Dec 4, 2021 at 4:06

Henry Ecker♦

35.8k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ongky Over a year ago

Thanks, it works. But unfortunately, with thousands of rows of data, it is kind of slow since it is like creating a new DataFrame for the mask. Is there any technics to fast retrieve row index that match the data pass to a function?

Henry Ecker Over a year ago

Numpy can be faster. See the updated solution (middle section) I've also added timings to show relative performance.

Srinivaas M.B. · Accepted Answer · 2021-12-04 23:54:52Z

1

To write Pandas dataframe filter condition from parameters in a function:

import pandas as pd

raw_data = {'NO_DOC':['A1','B1','C1'],'NO_SEQ':[1,2,3],'DESC':['A','B','C']}
data = pd.DataFrame(raw_data)

columns = ["NO_DOC", "NO_SEQ"]
value = ["A1", 1]

Create masks and send them to functions instead of columns and values:

mask1 = data[columns[0]] == value[0] 
mask2 = data[columns[1]] == value[1]

def find(data, mask1,mask2):
    return data.index[mask1 & mask2].tolist()

Call your function as follows:

find(data, mask1, mask2)

edited Dec 4, 2021 at 23:54

user17242583

answered Dec 4, 2021 at 3:10

Srinivaas M.B.

111 bronze badge

Collectives™ on Stack Overflow

How to write Pandas DataFrame filter condition from parameters in a function

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related