1

I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data):

url_id | url_text
15     | /course/123908/discussion_topics/394785/entries/980389/read

Using the re library in python I can find which URLs have this structure:

re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text) 

However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this:

url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15     | 1            | 1            | 0       | 0     | 1    | it goes like this

Can someone help me in extracting this specific info? I know that 'split' method of 'str' could be an option. But, I wonder if there is a better solution.

Thanks!

3
  • 2
    If your string consists of a fixed number of fields, separated by /, then split() is the best solution. Commented Jan 17, 2017 at 12:38
  • 1
    Why not just use regex capturing groups? Commented Jan 17, 2017 at 12:39
  • 1
    Yes, you may use a regex with capturing groups with re.finditer so as to have access to the whole match. Commented Jan 17, 2017 at 12:43

1 Answer 1

2

Do you mean something like this?

import re

text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"

for match in re.finditer(pattern, text):
    topic, entry  = match.group('topic'), match.group('entry')
    print('Topic ID={}, entry ID={}'.format(topic, entry))

Output

Topic ID=394785, entry ID=980389
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your answer! However, I have another question. Is it possible to apply this to a list without using loops?
Something like [text1, text2, text3,...]
@renakre I'm not sure. What's wrong with iterating through a list?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.