2

background:
by default the website is only showing few names and there s a "moreBtn" to generate the full list

code idea:
create Html session, render with script clicking the "moreBtn", parse and extract the html data with beautifulsoup

problem:
apparently the code only work once a while. dont see any error and response status is still 200 but no more interaction after clicking "moreBtn"

questions:
based on the behaviour it looks like the server blocked the interaction but i dont see message or error of getting blocked to access the website(it works fine if I open a real browser), I tried to catch some denied / blocked but fail to capture anything

  1. what is wrong here and any advise how to check if something is blocked?

  2. I know playwright or selenium will do the job but why "requests-html" fails to render consistently?

thank you so much

import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = "https://www.yuantaetfs.com/tradeInfo/pcf/0050"

def get_ETFPCF_yuanta_2():
    try:
        session = HTMLSession()
        response = session.get(url)

        print(response.status_code)
        script = """
        document.querySelector('div.moreBtn').click()
        """
        response.html.render(sleep=8,wait=8, script=script, timeout=8, keep_page=True)
        
        # Parse HTML content
        soup = BeautifulSoup(response.html.html, 'html.parser')

        response.html.session.close()
        # Find the table body (tbody)
        tbody = soup.find('div', class_='tbody')

        if tbody:
            # Extract all rows from the table body
            rows = tbody.find_all('div', class_='tr')

            # Process each row
            for row in rows:
                # Extract all cells (td or th elements)
                cells = row.find_all('div', class_='td')
                row_data = [cell.find('span', class_='').text for cell in cells]
                print(row_data)
        else:
            print("No tbody element found on the page.")

    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == '__main__':
    get_ETFPCF_yuanta_2()
4
  • some servers may block access without any information - to make your life harder. Pages are created for humans and pages may not give you explanation when you run script - to make your life harder. Commented Apr 24 at 1:42
  • I've tried this with requests (without verification of the SSL certificate) in conjunction with BeautifulSoup and get consistent responses. Obviously, as BeautifulSoup is just a parser, you won't be able to emulate clicking the More button Commented Apr 24 at 7:14
  • when I click button in real browser then I don't see any request for new data (tool: DevTools in Chrome/Firefox, tab: Network). It seems some (new) values are somewhere in <script> with window.__NUXT__=... - so maybe you could try to extract data from this part of HTML. But this may need to create more complex code with text.find(), slicing, maybe regex. Or maybe it could run some JavaScript code which could extract it. Commented Apr 24 at 9:51
  • Use browser dev tools (Network + Console tabs) to compare what happens on a real browser vs requests-html. Commented Apr 24 at 10:23

1 Answer 1

0

To debug, I added this line if after line 31:

print(cells)

Output:

[<div class="td" data-v-818b5120=""><span class="d-md-none" data-v-818b5120="">股票代碼</span> <span data-v-818b5120="">1101</span></div>,
<div class="td" data-v-818b5120=""><span class="d-md-none" data-v-818b5120="">股票名稱</span> <span data-v-818b5120="">台泥</span></div>,
<div class="td" data-v-818b5120=""><span class="d-md-none" data-v-818b5120="">是否為現金替代</span> <span data-v-818b5120="">N</span></div>,
<div class="td" data-v-818b5120=""><span class="d-md-none" data-v-818b5120="">可否參予最小實物申購</span> <span data-v-818b5120="">Y</span></div>,
<div class="td" data-v-818b5120=""><span class="d-md-none" data-v-818b5120="">股數</span> <span data-v-818b5120="">3659</span></div>]

etc... So we can see that the two spans are a data span and a span with class "d-md-none"

I'm not sure which span you wanted the data from, but if it's the non-data span, just change the span class in your code from '' to 'd-md-none'

If it's the data span, you could modify your javascript to find it by attribute 'data-v-818b5120' and then print its value.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.