I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data):
url_id | url_text
15 | /course/123908/discussion_topics/394785/entries/980389/read
Using the re library in python I can find which URLs have this structure:
re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)
However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this:
url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15 | 1 | 1 | 0 | 0 | 1 | it goes like this
Can someone help me in extracting this specific info? I know that 'split' method of 'str' could be an option. But, I wonder if there is a better solution.
Thanks!
/, thensplit()is the best solution.re.finditerso as to have access to the whole match.