2

I'm wondering how to ensure that all rows in a dataframe contain a particular set of values.

For example:

VALUES = [1, 2]
df_no = pd.DataFrame(
    {
        "a": [1],
        "b": [1],
    }
)
df_yes = pd.DataFrame(
    {
        "a": [1],
        "b": [2],
        "c": [3],
    }
)

Here df_no doesn't contain values of VALUES in each of its rows, whereas df_yes does.

An approach is the following:

# check df_no
all(
    [
        all(value in row for value in VALUES)
        for row in df_no.apply(lambda x: x.unique(), axis=1)
    ]
)
# returns False

# check df_yes 
all(
    [
        all(value in row for value in VALUES)
        for row in df_yes.apply(lambda x: x.unique(), axis=1)
    ]
)

# returns True

I feel as though the approaches here might be so clear, and that there might be a more idiomatic way of going about things.

2 Answers 2

3

Use issubset in generator comprehension:

s = set(VALUES)
print (all(s.issubset(x) for x in df_no.to_numpy()))
False


s = set(VALUES)
print (all(s.issubset(x) for x in df_yes.to_numpy()))
True

What is faster? Depends of data:

VALUES = [1, 2]

df = pd.DataFrame(
    {
        "a": [1,2,8],
        "b": [2,8,2],
        "c": [3,1,1],
    }
)

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)


In [171]: %%timeit
     ...: s = set(VALUES)
     ...: all(s.issubset(x) for x in df.to_numpy())
     ...: 
     ...: 
55.9 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [172]: %%timeit
     ...: vals = set(VALUES)
     ...: df.apply(vals.issubset, axis=1).all()
     ...: 
     ...: 
211 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
print (df)



In [174]: %%timeit
     ...: s = set(VALUES)
     ...: all(s.issubset(x) for x in df.to_numpy())
     ...: 
     ...: 
5.46 ms ± 76.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [175]: %%timeit
     ...: vals = set(VALUES)
     ...: df.apply(vals.issubset, axis=1).all()
     ...: 
     ...: 
21.5 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

    
Sign up to request clarification or add additional context in comments.

11 Comments

@baxx - It is wrong answer :( Working for correct one.
@baxx - It not tested each row separately. It tested all data together
ah of course, yeah it's important that it's each row
is list comp preferable to apply? I'm not sure if there's any optimisations (or slowness) with apply that I'm unaware of
now it looks more or less like my answer :(
|
1

You can use python sets and issubset:

vals = set(VALUES)
df_yes.apply(lambda x: vals.issubset(set(x)), axis=1).all()

shorter version:

vals = set(VALUES)
df_yes.apply(vals.issubset, axis=1).all()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.