Use Regex with Python to get an specifc part of the iframe src

Question

I try to capture an iframe src content that I want to change. I don't have direct access to the HTML, I get it HTML from an API.

You can see some iframe example below:

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">
<iframe allowfullscreen="" frameborder="0" height="276" mozallowfullscreen="" scrolling="no" src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/%20f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490"></iframe>

I have many other type of iframe examples, the only part they have in common is this part of src content https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302

I create the following code to find an element:

// some code
regex_page_embed = r"http.?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/*"
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1)
            print(s1.group())

After that I create more code that I can use and effectively change the HTML using the API, I don't think is necessary to put it here. But when I use:

print(s1)
print(s1.group())

I got the following result:

<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(126, 211), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(227, 312), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/

I want to get the last part of the iframe src content. In the example below

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">

The f2c5f6ca3a4610c55d70cb211ef9d977 is the part that I want.

print(s1) and print(s1.group()) don't show the last part of the src content, how can I get the last part of the iframe src content?

Relevant read on parsing html content with regex: stackoverflow.com/a/1732454/9183344 — r.ook
– r.ook, Commented Mar 26, 2019 at 19:34
I'd just use bs4 to parse the iframe and then extract the src text content and go from there... — r.ook
– r.ook, Commented Mar 26, 2019 at 19:36
I try to use bs4 first to get the content, but I see that I get more results with regex than bs4. I investigate why this is happening and I find that some iframes are inserted in the page using javascript document.write. This way only regex was able to find it, bs4 can't find it as well. — fabiobh
– fabiobh, Commented Mar 26, 2019 at 19:41
Ah right, since it's dynamic contents you should be using a different module like selenium or requests-html. I'm actually surprised you are able to get the iframe in the bs4 extracted content at all. — r.ook
– r.ook, Commented Mar 26, 2019 at 19:46

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-03-26 19:47:38Z

1

A better regex for capturing the whole url while having any optional content between <iframe tag and src tag is this,

<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)

Match using this regex and capture your url from group1.

Online Demo

Here is your updated Python code,

regex_page_embed = r'<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)'
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1.group(1)) # extract url using first group

answered Mar 26, 2019 at 19:47

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Russ Brown · Accepted Answer · 2019-03-26 19:02:29Z

1

Use r'<iframe src="[^"]*/([^"]+)"' as the pattern for your search.

Example:

>>> text = """<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">"""
>>> pat = r'<iframe src="[^"]*/([^"]+)"'
>>> search = re.search(pat, text)
>>> search[1]
'f2c5f6ca3a4610c55d70cb211ef9d977'
>>>

answered Mar 26, 2019 at 19:02

Russ Brown

1716 bronze badges

1 Comment

fabiobh Over a year ago

I edit my question now, I include a second iframe example. I forgot to mention that I have another type of iframes include in the HTML. Your answer will be correct if all iframes are only based in the first iframe example. I have another iframe examples in my page that are completely different from the 2 examples that I provide, the only common part is the iframe src content.

Collectives™ on Stack Overflow

Use Regex with Python to get an specifc part of the iframe src

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related