Check rows in pandas dataframe contain particular values

Question

I'm wondering how to ensure that all rows in a dataframe contain a particular set of values.

For example:

VALUES = [1, 2]
df_no = pd.DataFrame(
    {
        "a": [1],
        "b": [1],
    }
)
df_yes = pd.DataFrame(
    {
        "a": [1],
        "b": [2],
        "c": [3],
    }
)

Here df_no doesn't contain values of VALUES in each of its rows, whereas df_yes does.

An approach is the following:

# check df_no
all(
    [
        all(value in row for value in VALUES)
        for row in df_no.apply(lambda x: x.unique(), axis=1)
    ]
)
# returns False

# check df_yes 
all(
    [
        all(value in row for value in VALUES)
        for row in df_yes.apply(lambda x: x.unique(), axis=1)
    ]
)

# returns True

I feel as though the approaches here might be so clear, and that there might be a more idiomatic way of going about things.

jezrael · Accepted Answer · 2021-08-24 12:08:06Z

3

Use issubset in generator comprehension:

s = set(VALUES)
print (all(s.issubset(x) for x in df_no.to_numpy()))
False


s = set(VALUES)
print (all(s.issubset(x) for x in df_yes.to_numpy()))
True

What is faster? Depends of data:

VALUES = [1, 2]

df = pd.DataFrame(
    {
        "a": [1,2,8],
        "b": [2,8,2],
        "c": [3,1,1],
    }
)

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)


In [171]: %%timeit
     ...: s = set(VALUES)
     ...: all(s.issubset(x) for x in df.to_numpy())
     ...: 
     ...: 
55.9 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [172]: %%timeit
     ...: vals = set(VALUES)
     ...: df.apply(vals.issubset, axis=1).all()
     ...: 
     ...: 
211 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
print (df)



In [174]: %%timeit
     ...: s = set(VALUES)
     ...: all(s.issubset(x) for x in df.to_numpy())
     ...: 
     ...: 
5.46 ms ± 76.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [175]: %%timeit
     ...: vals = set(VALUES)
     ...: df.apply(vals.issubset, axis=1).all()
     ...: 
     ...: 
21.5 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Aug 24, 2021 at 12:08

answered Aug 24, 2021 at 11:49

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

jezrael Over a year ago

@baxx - It is wrong answer :( Working for correct one.

jezrael Over a year ago

@baxx - It not tested each row separately. It tested all data together

baxx Over a year ago

ah of course, yeah it's important that it's each row

baxx Over a year ago

is list comp preferable to apply? I'm not sure if there's any optimisations (or slowness) with apply that I'm unaware of

mozway Over a year ago

now it looks more or less like my answer :(

|

mozway · Accepted Answer · 2021-08-24 11:51:06Z

1

You can use python sets and issubset:

vals = set(VALUES)
df_yes.apply(lambda x: vals.issubset(set(x)), axis=1).all()

shorter version:

vals = set(VALUES)
df_yes.apply(vals.issubset, axis=1).all()

answered Aug 24, 2021 at 11:51

mozway

267k13 gold badges56 silver badges106 bronze badges

Collectives™ on Stack Overflow

Check rows in pandas dataframe contain particular values

2 Answers 2

11 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related