Extracting digits from string column

Question

I have multiple columns names titles, I would like to extract a 6 digit figure from each of these columns where such a figure exists and place those digits in a new column names global_id. Some titles columns are empty or rather have nan as strings.

This is what I have written thus far:

def titles_split(df,col):
    df[col] = df[col].astype('str')
    return df[col].str.extract('(\d{6})')
for i in range(1,75):
    if (df_split['titles'+str(i)] == 'nan') == False:
        df_split['global_id'] = titles_split(df_split,'titles'+str(i))

So I would like to take the 6 digit figure and place it in a column names global_id only if the column does not have the string nan.

However, this returns the following error message:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Here is a sample of my data:

        {'titles1': {0: 'nan',
  1: 'nan',
  2: 'nan',
  3: 'nan',
  4: ':[]}] 3/16/2019 lightweight 870590 FALSE nan Cristopher di Girolamo Italy 1 [["career"         \\n        \\n2019-2019\\n]] /build/images/main/avatar.jpeg [[1153 2] [21 29]] 98 nan Miami  Flor'},
 'titles2': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles3': {0: 'nan',
  1: ':[]}] 2/13/2016 cruiserweight 746272 FALSE nan Alvin Davie USA 3 [["career"         \\n        \\n2016-2019\\n]] /build/images/main/avatar.jpeg [[555 1140] [110 226]] 98 nan Miami  Flor',
  2: 'nan',
  3: 'nan',
  4: 'nan'},
 'titles4': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan'},
 'titles6': {0: ':[]}] 10/10/2015 heavyweight 734308 FALSE [6 2 188] Joseph White USA 6 [["career"         \\n        \\n2015-2019\\n]] https://boxrec.com/media/images/thumb/9/9c/734308.jpeg/200px-734308.jpeg [[679 1311] [180 350]] 98 nan Miami  Flor',
  1: 'nan',
  2: ':[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 193] Anthony Martinez USA 6 [["career"         \\n        \\n2018-2019\\n]] https://boxrec.com/media/images/thumb/c/cb/AnthonyMartinez.jpg/200px-AnthonyMartinez.jpg [[648 1311] [171 350]] 98 [78 198] Miami  Flor',
  3: 'nan',
  4: 'nan'}}

Update:

I managed to get rid of the initial error by replacing == with 'is' however the problem now is I get nan values for all rows in the new global_id column.

So this is what I am doing now

def titles_split(df,col):
    return df[col].str.extractall('(\d{6})')
for i in range(1,75):
    if (df_split['titles'+str(i)] == 'nan') is False:
        df_split['global_id'] = titles_split(df_split,'titles'+str(i))

This is the output of the global_id column:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
     ...

Format your data in proper way, it's hard to understand which value belongs to which column. And add complete error message and expected output — Sociopath
– Sociopath, Commented Nov 18, 2019 at 9:42
You could use regex. pandas.Series.str.extract or pandas.Series.str.extractall pandas.pydata.org/pandas-docs/stable/reference/api/… — Viach
– Viach, Commented Nov 18, 2019 at 9:44
What do you mean in this line "df_split['titles'+str(i)] == 'nan'", comparing the Series with 'nan' is meaningless — strnam
– strnam, Commented Nov 18, 2019 at 10:09
try this Regex method while inside the function titles_split(). re.findall(r'\d{6}', test_string) test_string is the value from df[col].str — Subbu VidyaSekar
– Subbu VidyaSekar, Commented Nov 18, 2019 at 10:14

Viach · Accepted Answer · 2019-11-18 10:14:28Z

4

Using pandas str functions:

df['global_id'] = df.loc[:, df.columns].apply(str, axis=1).str.extract(r'.*(\d{6})')
df

titles1     titles2     titles3     titles4     titles5     titles6     global_id
0   nan     nan     nan     nan     nan     :[]}] 10/10/2015 heavyweight 734308 FALSE [6 2...   734308
1   nan     nan     :[]}] 2/13/2016 cruiserweight 746272 FALSE nan...   nan     nan     nan     746272
2   nan     nan     nan     nan     nan     :[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 ...   827050
3   nan     nan     nan     nan     nan     nan     NaN
4   :[]}] 3/16/2019 lightweight 870590 FALSE nan C...   nan     nan     nan     nan     nan     870590

answered Nov 18, 2019 at 10:14

Viach

5084 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henry Yik · Accepted Answer · 2019-11-18 10:17:52Z

You can also use stack and merge:

df = pd.DataFrame(d)

s = df.stack().str.extract(r"(\d{6})").unstack().bfill(axis=1).iloc[:, 0]

print (df.merge(s,how="left",left_index=True,right_index=True))

#

                                                 titles1 titles2                                            titles3 titles4 titles5                                            titles6 (0, titles1)
0                                                nan     nan                                                nan     nan     nan  :[]}] 10/10/2015 heavyweight 734308 FALSE [6 2...       734308
1                                                nan     nan  :[]}] 2/13/2016 cruiserweight 746272 FALSE nan...     nan     nan                                                nan       746272
2                                                nan     nan                                                nan     nan     nan  :[]}] 2/24/2018 heavyweight 827050 FALSE [6 4 ...       827050
3                                                nan     nan                                                nan     nan     nan                                                nan          NaN
4  :[]}] 3/16/2019 lightweight 870590 FALSE nan C...     nan                                                nan     nan     nan                                                nan       870590

Collectives™ on Stack Overflow

Extracting digits from string column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related