0

I'm scraping reddit usernames using Python and I'm trying to extract the username from an URL. The URL looks like this:

https://www.reddit.com/user/ExampleUser

This is my code:

def extract_username(url):
    start = url.find('https://www.reddit.com/user/') + 28
    end = url.find('?', start)
    end2 = url.find("/", start)
    return url[start:end] and url[start:end2] and url[start:]

The first part works but removing the question mark and forward slash doesen't. Maybe I'm using the "and" keyword wrong? Which means I sometimes get something like this:

ExampleUser/
ExampleUser/comments/
ExampleUser/submitted/
ExampleUser/gilded/
ExampleUser?sort=hot
ExampleUser?sort=new
ExampleUser?sort=top
ExampleUser?sort=controversial

I know I can use the api but i'd like to learn how to do it without. I've also heard about regular expressions but aren't they pretty slow?

3 Answers 3

3

You could use re module.

>>> s = "https://www.reddit.com/user/ExampleUser/comments/"
>>> import re
>>> re.search(r'https://www.reddit.com/user/([^/?]+)', s).group(1)
'ExampleUser'

[^/?]+ negated character class which matches any character but not of / or ? one or more times. () capturing group around the negated character class captures those matched characters. Later we could refer the captured characters through back-referencing (like \1 which refers the group index 1).

By defining a separate function.

>>> def extract_username(url):
...     return re.search(r'https://www.reddit.com/user/([^/?]+)', url).group(1)
... 
>>> extract_username('https://www.reddit.com/user/ExampleUser')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser/submitted/')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser?sort=controversial')
'ExampleUser'
Sign up to request clarification or add additional context in comments.

Comments

3

This removes anything which follows a '?' and then splits on '/', retrieving the fifth element which is the user name:

>>> s = 'https://www.reddit.com/user/ExampleUser?sort=new'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'

This also works on the other cases that you showed. For example:

>>> s = 'https://www.reddit.com/user/ExampleUser/comments/'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'
>>> s = 'https://www.reddit.com/user/ExampleUser'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'

Comments

0

Just for kicks, here's an example using find. Basically, you just want to take the minimum where you find your delimiter or the end if it's not found at all:

def extract_username(url):
    username = url[len('https://www.reddit.com/user/'):]
    end = min([i for i in (len(username), 
                           username.find('/'), 
                           username.find('?') ) if i >=0])
    return username[:end]

for url in ('https://www.reddit.com/user/ExampleUser', 
          'https://www.reddit.com/user/ExampleUser/submitted/',
          'https://www.reddit.com/user/ExampleUser?sort=controversial'):
    print extract_username(url)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.