Extracting specific information from a string list using regular expressions

Question

I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data):

url_id | url_text
15     | /course/123908/discussion_topics/394785/entries/980389/read

Using the re library in python I can find which URLs have this structure:

re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)

However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this:

url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15     | 1            | 1            | 0       | 0     | 1    | it goes like this

Can someone help me in extracting this specific info? I know that 'split' method of 'str' could be an option. But, I wonder if there is a better solution.

Thanks!

If your string consists of a fixed number of fields, separated by /, then split() is the best solution. — Tomalak
– Tomalak, Commented Jan 17, 2017 at 12:38
Yes, you may use a regex with capturing groups with re.finditer so as to have access to the whole match. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jan 17, 2017 at 12:43

Tagc · Accepted Answer · 2017-01-17 12:43:25Z

2

Do you mean something like this?

import re

text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"

for match in re.finditer(pattern, text):
    topic, entry  = match.group('topic'), match.group('entry')
    print('Topic ID={}, entry ID={}'.format(topic, entry))

Output

Topic ID=394785, entry ID=980389

answered Jan 17, 2017 at 12:43

Tagc

9,1409 gold badges68 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

renakre Over a year ago

Thanks for your answer! However, I have another question. Is it possible to apply this to a list without using loops?

renakre Over a year ago

Something like [text1, text2, text3,...]

Tagc Over a year ago

@renakre I'm not sure. What's wrong with iterating through a list?

Collectives™ on Stack Overflow

Extracting specific information from a string list using regular expressions

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related