1

I have multiple columns names titles, I would like to extract a 6 digit figure from each of these columns where such a figure exists and place those digits in a new column names global_id. Some titles columns are empty or rather have nan as strings.

This is what I have written thus far:

def titles_split(df,col):
    df[col] = df[col].astype('str')
    return df[col].str.extract('(\d{6})')
for i in range(1,75):
    if (df_split['titles'+str(i)] == 'nan') == False:
        df_split['global_id'] = titles_split(df_split,'titles'+str(i))

So I would like to take the 6 digit figure and place it in a column names global_id only if the column does not have the string nan.

However, this returns the following error message:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Here is a sample of my data:

        {'titles1': {0: 'nan',
  1: 'nan',
  2: 'nan',
  3: 'nan',
  4: ':[]}] 3/16/2019 lightweight 870590 FALSE nan Cristopher di Girolamo Italy 1 [["career"         \\n        \\n2019-2019\\n]] /build/images/main/avatar.jpeg [[1153 2] [21 29]] 98 nan Miami  Flor'},
 'titles2': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles3': {0: 'nan',
  1: ':[]}] 2/13/2016 cruiserweight 746272 FALSE nan Alvin Davie USA 3 [["career"         \\n        \\n2016-2019\\n]] /build/images/main/avatar.jpeg [[555 1140] [110 226]] 98 nan Miami  Flor',
  2: 'nan',
  3: 'nan',
  4: 'nan'},
 'titles4': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles6': {0: ':[]}] 10/10/2015 heavyweight 734308 FALSE [6 2 188] Joseph White USA 6 [["career"         \\n        \\n2015-2019\\n]] https://boxrec.com/media/images/thumb/9/9c/734308.jpeg/200px-734308.jpeg [[679 1311] [180 350]] 98 nan Miami  Flor',
  1: 'nan',
  2: ':[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 193] Anthony Martinez USA 6 [["career"         \\n        \\n2018-2019\\n]] https://boxrec.com/media/images/thumb/c/cb/AnthonyMartinez.jpg/200px-AnthonyMartinez.jpg [[648 1311] [171 350]] 98 [78 198] Miami  Flor',
  3: 'nan',
  4: 'nan'}}

Update:

I managed to get rid of the initial error by replacing == with 'is' however the problem now is I get nan values for all rows in the new global_id column.

So this is what I am doing now

def titles_split(df,col):
    return df[col].str.extractall('(\d{6})')
for i in range(1,75):
    if (df_split['titles'+str(i)] == 'nan') is False:
        df_split['global_id'] = titles_split(df_split,'titles'+str(i))

This is the output of the global_id column:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
     ... 
4
  • Format your data in proper way, it's hard to understand which value belongs to which column. And add complete error message and expected output Commented Nov 18, 2019 at 9:42
  • You could use regex. pandas.Series.str.extract or pandas.Series.str.extractall pandas.pydata.org/pandas-docs/stable/reference/api/… Commented Nov 18, 2019 at 9:44
  • What do you mean in this line "df_split['titles'+str(i)] == 'nan'", comparing the Series with 'nan' is meaningless Commented Nov 18, 2019 at 10:09
  • try this Regex method while inside the function titles_split(). re.findall(r'\d{6}', test_string) test_string is the value from df[col].str Commented Nov 18, 2019 at 10:14

2 Answers 2

4

Using pandas str functions:

df['global_id'] = df.loc[:, df.columns].apply(str, axis=1).str.extract(r'.*(\d{6})')
df

titles1     titles2     titles3     titles4     titles5     titles6     global_id
0   nan     nan     nan     nan     nan     :[]}] 10/10/2015 heavyweight 734308 FALSE [6 2...   734308
1   nan     nan     :[]}] 2/13/2016 cruiserweight 746272 FALSE nan...   nan     nan     nan     746272
2   nan     nan     nan     nan     nan     :[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 ...   827050
3   nan     nan     nan     nan     nan     nan     NaN
4   :[]}] 3/16/2019 lightweight 870590 FALSE nan C...   nan     nan     nan     nan     nan     870590
Sign up to request clarification or add additional context in comments.

Comments

2

You can also use stack and merge:

df = pd.DataFrame(d)

s = df.stack().str.extract(r"(\d{6})").unstack().bfill(axis=1).iloc[:, 0]

print (df.merge(s,how="left",left_index=True,right_index=True))

#

                                                 titles1 titles2                                            titles3 titles4 titles5                                            titles6 (0, titles1)
0                                                nan     nan                                                nan     nan     nan  :[]}] 10/10/2015 heavyweight 734308 FALSE [6 2...       734308
1                                                nan     nan  :[]}] 2/13/2016 cruiserweight 746272 FALSE nan...     nan     nan                                                nan       746272
2                                                nan     nan                                                nan     nan     nan  :[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 ...       827050
3                                                nan     nan                                                nan     nan     nan                                                nan          NaN
4  :[]}] 3/16/2019 lightweight 870590 FALSE nan C...     nan                                                nan     nan     nan                                                nan       870590

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.