1

I try to capture an iframe src content that I want to change. I don't have direct access to the HTML, I get it HTML from an API.

You can see some iframe example below:

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">
<iframe allowfullscreen="" frameborder="0" height="276" mozallowfullscreen="" scrolling="no" src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/%20f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490"></iframe>

I have many other type of iframe examples, the only part they have in common is this part of src content https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302

I create the following code to find an element:

// some code
regex_page_embed = r"http.?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/*"
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1)
            print(s1.group())

After that I create more code that I can use and effectively change the HTML using the API, I don't think is necessary to put it here. But when I use:

print(s1)
print(s1.group())

I got the following result:

<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(126, 211), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(227, 312), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/

I want to get the last part of the iframe src content. In the example below

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">

The f2c5f6ca3a4610c55d70cb211ef9d977 is the part that I want.

print(s1) and print(s1.group()) don't show the last part of the src content, how can I get the last part of the iframe src content?

5
  • 1
    In the regex, change the star at the end to (.*?)(?=\"). Commented Mar 26, 2019 at 18:49
  • Relevant read on parsing html content with regex: stackoverflow.com/a/1732454/9183344 Commented Mar 26, 2019 at 19:34
  • I'd just use bs4 to parse the iframe and then extract the src text content and go from there... Commented Mar 26, 2019 at 19:36
  • I try to use bs4 first to get the content, but I see that I get more results with regex than bs4. I investigate why this is happening and I find that some iframes are inserted in the page using javascript document.write. This way only regex was able to find it, bs4 can't find it as well. Commented Mar 26, 2019 at 19:41
  • Ah right, since it's dynamic contents you should be using a different module like selenium or requests-html. I'm actually surprised you are able to get the iframe in the bs4 extracted content at all. Commented Mar 26, 2019 at 19:46

2 Answers 2

1

A better regex for capturing the whole url while having any optional content between <iframe tag and src tag is this,

<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)

Match using this regex and capture your url from group1.

Online Demo

Here is your updated Python code,

regex_page_embed = r'<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)'
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1.group(1)) # extract url using first group
Sign up to request clarification or add additional context in comments.

Comments

1

Use r'<iframe src="[^"]*/([^"]+)"' as the pattern for your search.

Example:

>>> text = """<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">"""
>>> pat = r'<iframe src="[^"]*/([^"]+)"'
>>> search = re.search(pat, text)
>>> search[1]
'f2c5f6ca3a4610c55d70cb211ef9d977'
>>> 

1 Comment

I edit my question now, I include a second iframe example. I forgot to mention that I have another type of iframes include in the HTML. Your answer will be correct if all iframes are only based in the first iframe example. I have another iframe examples in my page that are completely different from the 2 examples that I provide, the only common part is the iframe src content.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.