0

I am practicing pandas and I have an exercise with which I have a problem

I have an excel file that has a column where two types of urls are stored.

df = pd.DataFrame({'id': [], 
                   'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
   | id | url |
    | -------- | -------------- |
    |     | 'www.something/12312'  |
    |   | 'www.something/12343'    |
    |     | 'www.somethingelse/42312'    | 
    |    | 'www.somethingelse/62343'    | 

I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:

df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312    | 'www.something/12312'  |
| id_12343    | 'www.something/12343'    |
| diffid_42312    | 'www.somethingelse/42312'    | 
| diffid_62343    | 'www.somethingelse/62343'    | 

My problem is how to get only numbers and replace them if that kind of id? I have tried the replace and extract function in pandas

id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
        
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())

However, they are throwing an error TypeError: expected string or bytes-like object.

Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.

3
  • 1
    Please format your examples so they are reproducible: stackoverflow.com/questions/20109391/… Commented May 27, 2021 at 9:49
  • Formatting corrected Commented May 27, 2021 at 10:44
  • what exactly is diffid? When do you use id as prefix and when to use diffid? Commented May 27, 2021 at 11:43

1 Answer 1

3

Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..

>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
         id                      url
0  id_12312      www.something/12312
1  id_12343      www.something/12343
2  id_42312  www.somethingelse/42312
3  id_62343  www.somethingelse/62343

Or using str.extract

>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. The prefix is supposed to be different for a different web page, so when I have a webpage somethingelse the prefix is diffid_, but when I have webpage something the prefix is id_
Thank I managed to solve it for prefix too thanks to your help :) df['id_num'] = df.url.str.extract(r'/(\d+)$').astype(str) df['id_prefix'] = np.where((df['url'].str.contains('somethingelse')), 'diffid_', 'id_') df['id'] = df['id_prefix'] + df['id_num']

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.