I have a Pandas dataframe that contains two columns that contain either lists of items or NaN values. An illustrative example could be generated using:
import numpy as np
import pandas as pd
df = pd.DataFrame({'colA':['ab','abc','de','def','ghi','jkl','mno','pqr','stw','stu'],
'colB':['abcd','bcde','defg','defh','ghijk','j','mnp','pq','stuw','sut'] })
df['colA'] = df['colA'].apply(lambda x: list(x))
df['colB'] = df['colB'].apply(lambda x: list(x))
df.at[3,'colB'] = np.nan
df.at[8,'colB'] = np.nan
... which looks like:
colA colB
0 [a, b] [a, b, c, d]
1 [a, b, c] [b, c, d, e]
2 [d, e] [d, e, f, g]
3 [d, e, f] NaN
4 [g, h, i] [g, h, i, j, k]
5 [j, k, l] [j]
6 [m, n, o] [m, n, p]
7 [p, q, r] [p, q]
8 [s, t, w] NaN
9 [s, t, u] [s, u, t]
I want to be perform a variety of tasks on the pairs of lists (e.g. using NLTK's jacquard_distance() function) but only if colB does not contain NaN.
The following command works well if there are no NaN values:
import nltk
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])),axis = 1)
However, if colB contains a NaN, the following error is produced:
TypeError: ("'float' object is not iterable", 'occurred at index 3')
I've tried to use an if...else clause to only run the function on rows where colB does not contain a NaN:
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if pd.notnull(x['colB']) else np.nan,axis = 1)
... but this produces an error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 0')
I've also tried to use the .any() and .all() constructs as suggested in the error but to no avail.
It seems that passing a list to pd.notnull() causes confusion because pd.notnull() wants to test each element of the list whereas what I want is to consider the whole contents of the dataframe cell as NaN or not.
My question is how can I identify whether a cell in a Pandas dataframe contains a NaN value so that the lambda function can be applied only to those cells that do not contain NaN?