Testing whether Pandas dataframe cell contains null value

Question

I have a Pandas dataframe that contains two columns that contain either lists of items or NaN values. An illustrative example could be generated using:

import numpy as np
import pandas as pd

df = pd.DataFrame({'colA':['ab','abc','de','def','ghi','jkl','mno','pqr','stw','stu'],
                       'colB':['abcd','bcde','defg','defh','ghijk','j','mnp','pq','stuw','sut'] })


df['colA'] = df['colA'].apply(lambda x: list(x))
df['colB'] = df['colB'].apply(lambda x: list(x))

df.at[3,'colB'] = np.nan
df.at[8,'colB'] = np.nan

... which looks like:

        colA             colB
0     [a, b]     [a, b, c, d]
1  [a, b, c]     [b, c, d, e]
2     [d, e]     [d, e, f, g]
3  [d, e, f]              NaN
4  [g, h, i]  [g, h, i, j, k]
5  [j, k, l]              [j]
6  [m, n, o]        [m, n, p]
7  [p, q, r]           [p, q]
8  [s, t, w]              NaN
9  [s, t, u]        [s, u, t]

I want to be perform a variety of tasks on the pairs of lists (e.g. using NLTK's jacquard_distance() function) but only if colB does not contain NaN.

The following command works well if there are no NaN values:

import nltk

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])),axis = 1)

However, if colB contains a NaN, the following error is produced:

TypeError: ("'float' object is not iterable", 'occurred at index 3')

I've tried to use an if...else clause to only run the function on rows where colB does not contain a NaN:

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if pd.notnull(x['colB']) else np.nan,axis = 1)

... but this produces an error:

ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 0')

I've also tried to use the .any() and .all() constructs as suggested in the error but to no avail.

It seems that passing a list to pd.notnull() causes confusion because pd.notnull() wants to test each element of the list whereas what I want is to consider the whole contents of the dataframe cell as NaN or not.

My question is how can I identify whether a cell in a Pandas dataframe contains a NaN value so that the lambda function can be applied only to those cells that do not contain NaN?

jezrael · Accepted Answer · 2020-01-02 09:27:42Z

1

You can filter rows only for non missing values:

f = lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB']))
m = df['colB'].notna()
df.loc[m, 'jd'] = df[m].apply(f,axis = 1)
print (df)
        colA             colB        jd
0     [a, b]     [a, b, c, d]  0.500000
1  [a, b, c]     [b, c, d, e]  0.600000
2     [d, e]     [d, e, f, g]  0.500000
3  [d, e, f]              NaN       NaN
4  [g, h, i]  [g, h, i, j, k]  0.400000
5  [j, k, l]              [j]  0.666667
6  [m, n, o]        [m, n, p]  0.500000
7  [p, q, r]           [p, q]  0.333333
8  [s, t, w]              NaN       NaN
9  [s, t, u]        [s, u, t]  0.000000

Resaon why checking missing values is lists are checked elemetwise:

df['jd'] = df.apply(lambda x: pd.notna(x['colB']), axis = 1)
print (df)
        colA             colB                              jd
0     [a, b]     [a, b, c, d]        [True, True, True, True]
1  [a, b, c]     [b, c, d, e]        [True, True, True, True]
2     [d, e]     [d, e, f, g]        [True, True, True, True]
3  [d, e, f]              NaN                           False
4  [g, h, i]  [g, h, i, j, k]  [True, True, True, True, True]
5  [j, k, l]              [j]                          [True]
6  [m, n, o]        [m, n, p]              [True, True, True]
7  [p, q, r]           [p, q]                    [True, True]
8  [s, t, w]              NaN                           False
9  [s, t, u]        [s, u, t]              [True, True, True]

edited Jan 2, 2020 at 9:27

answered Jan 2, 2020 at 9:17

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1718097 Over a year ago

Ah yes, of course! Sometimes tunnel-vision sets in and I can't see the easiest solution. Thanks for the (ultra-fast!) response.

user1718097 · Accepted Answer · 2020-01-02 09:47:47Z

It dawned on me as I was writing the question that instead of testing if the content of the cell was not NaN, I could test whether the content of the cell was a list. D'oh! I used the following:

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if isinstance(x['colB'],list) else np.nan,axis = 1)

This works as required and produces the output:

        colA             colB        jd
0     [a, b]     [a, b, c, d]  0.500000
1  [a, b, c]     [b, c, d, e]  0.600000
2     [d, e]     [d, e, f, g]  0.500000
3  [d, e, f]              NaN       NaN
4  [g, h, i]  [g, h, i, j, k]  0.400000
5  [j, k, l]              [j]  0.666667
6  [m, n, o]        [m, n, p]  0.500000
7  [p, q, r]           [p, q]  0.333333
8  [s, t, w]              NaN       NaN
9  [s, t, u]        [s, u, t]  0.000000

But jezrael's answer (to filter on NaN beforehand) is probably the most logical approach.

Nevertheless, I would still like to know if there's a way to test for NaN explicitly.

Collectives™ on Stack Overflow

Testing whether Pandas dataframe cell contains null value

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related