1

I've written a script to pull a list of available report url extensions page available for text extraction.

I've used parsing and BeautifulSoup to extract the reference area for the latest report using this method.

home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))

This generates a list with each report and it's unique html extension, which is updated each time a new report is uploaded, for example:

[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]

I've managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).

latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm",latest_sitrep_location)

This gives me:

"2022-05/13/c_76843.htm"

But when I try to do this for every element of the list it just throws me all the junk in-between:

all_urls= re.findall(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm", str(report_url_locations))
all_urls

['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]

But what I want is:

["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]

Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I'm fairly sure I need to convert every element in report_url_locations to be strings, but I don't know how to do this en-masse.

2 Answers 2

2

Why don't you just try this:

report_url_locations = [x["href"] for x in container.findAll('a')]

And then just print the report_url_locations

By the way, here's why you shouldn't be using regex to parse an HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

That was just as smooth as butter, thank you :)
1

Edit: Don't use regex for HTML parsing, you know the drill.

If you're decided on using regex though, you could use r'(?:href=)\"(.*?)\"'.


text="""<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>
"""

re.findall(r'(?:href=)\"(.*?)\"', text)

Which outputs

['2022-05/13/c_76843.htm',
 '2022-05/12/c_76842.htm',
 '2022-05/11/c_76841.htm',
 '2022-05/10/c_76839.htm']

3 Comments

Regex is a poor choice for HTML parsing. Read this thread on SO.
I know I know, this is just for fun.
I'll add a disclaimer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.