I have multiple columns names titles, I would like to extract a 6 digit figure from each of these columns where such a figure exists and place those digits in a new column names global_id. Some titles columns are empty or rather have nan as strings.
This is what I have written thus far:
def titles_split(df,col):
df[col] = df[col].astype('str')
return df[col].str.extract('(\d{6})')
for i in range(1,75):
if (df_split['titles'+str(i)] == 'nan') == False:
df_split['global_id'] = titles_split(df_split,'titles'+str(i))
So I would like to take the 6 digit figure and place it in a column names global_id only if the column does not have the string nan.
However, this returns the following error message:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here is a sample of my data:
{'titles1': {0: 'nan',
1: 'nan',
2: 'nan',
3: 'nan',
4: ':[]}] 3/16/2019 lightweight 870590 FALSE nan Cristopher di Girolamo Italy 1 [["career" \\n \\n2019-2019\\n]] /build/images/main/avatar.jpeg [[1153 2] [21 29]] 98 nan Miami Flor'},
'titles2': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
'titles3': {0: 'nan',
1: ':[]}] 2/13/2016 cruiserweight 746272 FALSE nan Alvin Davie USA 3 [["career" \\n \\n2016-2019\\n]] /build/images/main/avatar.jpeg [[555 1140] [110 226]] 98 nan Miami Flor',
2: 'nan',
3: 'nan',
4: 'nan'},
'titles4': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
'titles5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
'titles6': {0: ':[]}] 10/10/2015 heavyweight 734308 FALSE [6 2 188] Joseph White USA 6 [["career" \\n \\n2015-2019\\n]] https://boxrec.com/media/images/thumb/9/9c/734308.jpeg/200px-734308.jpeg [[679 1311] [180 350]] 98 nan Miami Flor',
1: 'nan',
2: ':[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 193] Anthony Martinez USA 6 [["career" \\n \\n2018-2019\\n]] https://boxrec.com/media/images/thumb/c/cb/AnthonyMartinez.jpg/200px-AnthonyMartinez.jpg [[648 1311] [171 350]] 98 [78 198] Miami Flor',
3: 'nan',
4: 'nan'}}
Update:
I managed to get rid of the initial error by replacing == with 'is' however the problem now is I get nan values for all rows in the new global_id column.
So this is what I am doing now
def titles_split(df,col):
return df[col].str.extractall('(\d{6})')
for i in range(1,75):
if (df_split['titles'+str(i)] == 'nan') is False:
df_split['global_id'] = titles_split(df_split,'titles'+str(i))
This is the output of the global_id column:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
titles_split().re.findall(r'\d{6}', test_string)test_stringis the value fromdf[col].str