Extracting RegEx pattern across list excluding other html code

Question

I've written a script to pull a list of available report url extensions page available for text extraction.

I've used parsing and BeautifulSoup to extract the reference area for the latest report using this method.

home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))

This generates a list with each report and it's unique html extension, which is updated each time a new report is uploaded, for example:

[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]

I've managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).

latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm",latest_sitrep_location)

This gives me:

"2022-05/13/c_76843.htm"

But when I try to do this for every element of the list it just throws me all the junk in-between:

all_urls= re.findall(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm", str(report_url_locations))
all_urls

['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]

But what I want is:

["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]

Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I'm fairly sure I need to convert every element in report_url_locations to be strings, but I don't know how to do this en-masse.

baduker · Accepted Answer · 2022-05-16 11:35:09Z

2

Why don't you just try this:

report_url_locations = [x["href"] for x in container.findAll('a')]

And then just print the report_url_locations

By the way, here's why you shouldn't be using regex to parse an HTML.

edited May 16, 2022 at 11:35

answered May 16, 2022 at 11:33

baduker

20.2k9 gold badges44 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pryore Over a year ago

That was just as smooth as butter, thank you :)

Nathan Furnal · Accepted Answer · 2022-05-16 11:38:06Z

1

Edit: Don't use regex for HTML parsing, you know the drill.

If you're decided on using regex though, you could use r'(?:href=)\"(.*?)\"'.


text="""<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>
"""

re.findall(r'(?:href=)\"(.*?)\"', text)

Which outputs

['2022-05/13/c_76843.htm',
 '2022-05/12/c_76842.htm',
 '2022-05/11/c_76841.htm',
 '2022-05/10/c_76839.htm']

edited May 16, 2022 at 11:38

answered May 16, 2022 at 11:35

Nathan Furnal

2,4503 gold badges14 silver badges28 bronze badges

3 Comments

baduker Over a year ago

Regex is a poor choice for HTML parsing. Read this thread on SO.

Nathan Furnal Over a year ago

I know I know, this is just for fun.

Nathan Furnal Over a year ago

I'll add a disclaimer

Collectives™ on Stack Overflow

Extracting RegEx pattern across list excluding other html code

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related