1

I have a Pandas dataframe that contains two columns that contain either lists of items or NaN values. An illustrative example could be generated using:

import numpy as np
import pandas as pd

df = pd.DataFrame({'colA':['ab','abc','de','def','ghi','jkl','mno','pqr','stw','stu'],
                       'colB':['abcd','bcde','defg','defh','ghijk','j','mnp','pq','stuw','sut'] })


df['colA'] = df['colA'].apply(lambda x: list(x))
df['colB'] = df['colB'].apply(lambda x: list(x))

df.at[3,'colB'] = np.nan
df.at[8,'colB'] = np.nan

... which looks like:

        colA             colB
0     [a, b]     [a, b, c, d]
1  [a, b, c]     [b, c, d, e]
2     [d, e]     [d, e, f, g]
3  [d, e, f]              NaN
4  [g, h, i]  [g, h, i, j, k]
5  [j, k, l]              [j]
6  [m, n, o]        [m, n, p]
7  [p, q, r]           [p, q]
8  [s, t, w]              NaN
9  [s, t, u]        [s, u, t]

I want to be perform a variety of tasks on the pairs of lists (e.g. using NLTK's jacquard_distance() function) but only if colB does not contain NaN.

The following command works well if there are no NaN values:

import nltk

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])),axis = 1)

However, if colB contains a NaN, the following error is produced:

TypeError: ("'float' object is not iterable", 'occurred at index 3')

I've tried to use an if...else clause to only run the function on rows where colB does not contain a NaN:

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if pd.notnull(x['colB']) else np.nan,axis = 1)

... but this produces an error:

ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 0')

I've also tried to use the .any() and .all() constructs as suggested in the error but to no avail.

It seems that passing a list to pd.notnull() causes confusion because pd.notnull() wants to test each element of the list whereas what I want is to consider the whole contents of the dataframe cell as NaN or not.

My question is how can I identify whether a cell in a Pandas dataframe contains a NaN value so that the lambda function can be applied only to those cells that do not contain NaN?

2 Answers 2

1

You can filter rows only for non missing values:

f = lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB']))
m = df['colB'].notna()
df.loc[m, 'jd'] = df[m].apply(f,axis = 1)
print (df)
        colA             colB        jd
0     [a, b]     [a, b, c, d]  0.500000
1  [a, b, c]     [b, c, d, e]  0.600000
2     [d, e]     [d, e, f, g]  0.500000
3  [d, e, f]              NaN       NaN
4  [g, h, i]  [g, h, i, j, k]  0.400000
5  [j, k, l]              [j]  0.666667
6  [m, n, o]        [m, n, p]  0.500000
7  [p, q, r]           [p, q]  0.333333
8  [s, t, w]              NaN       NaN
9  [s, t, u]        [s, u, t]  0.000000

Resaon why checking missing values is lists are checked elemetwise:

df['jd'] = df.apply(lambda x: pd.notna(x['colB']), axis = 1)
print (df)
        colA             colB                              jd
0     [a, b]     [a, b, c, d]        [True, True, True, True]
1  [a, b, c]     [b, c, d, e]        [True, True, True, True]
2     [d, e]     [d, e, f, g]        [True, True, True, True]
3  [d, e, f]              NaN                           False
4  [g, h, i]  [g, h, i, j, k]  [True, True, True, True, True]
5  [j, k, l]              [j]                          [True]
6  [m, n, o]        [m, n, p]              [True, True, True]
7  [p, q, r]           [p, q]                    [True, True]
8  [s, t, w]              NaN                           False
9  [s, t, u]        [s, u, t]              [True, True, True]
Sign up to request clarification or add additional context in comments.

1 Comment

Ah yes, of course! Sometimes tunnel-vision sets in and I can't see the easiest solution. Thanks for the (ultra-fast!) response.
0

It dawned on me as I was writing the question that instead of testing if the content of the cell was not NaN, I could test whether the content of the cell was a list. D'oh! I used the following:

df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if isinstance(x['colB'],list) else np.nan,axis = 1)

This works as required and produces the output:

        colA             colB        jd
0     [a, b]     [a, b, c, d]  0.500000
1  [a, b, c]     [b, c, d, e]  0.600000
2     [d, e]     [d, e, f, g]  0.500000
3  [d, e, f]              NaN       NaN
4  [g, h, i]  [g, h, i, j, k]  0.400000
5  [j, k, l]              [j]  0.666667
6  [m, n, o]        [m, n, p]  0.500000
7  [p, q, r]           [p, q]  0.333333
8  [s, t, w]              NaN       NaN
9  [s, t, u]        [s, u, t]  0.000000

But jezrael's answer (to filter on NaN beforehand) is probably the most logical approach.

Nevertheless, I would still like to know if there's a way to test for NaN explicitly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.