background:
by default the website is only showing few names and there s a "moreBtn" to generate the full list
code idea:
create Html session, render with script clicking the "moreBtn", parse and extract the html data with beautifulsoup
problem:
apparently the code only work once a while. dont see any error and response status is still 200 but no more interaction after clicking "moreBtn"
questions:
based on the behaviour it looks like the server blocked the interaction but i dont see message or error of getting blocked to access the website(it works fine if I open a real browser), I tried to catch some denied / blocked but fail to capture anything
what is wrong here and any advise how to check if something is blocked?
I know playwright or selenium will do the job but why "requests-html" fails to render consistently?
thank you so much
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = "https://www.yuantaetfs.com/tradeInfo/pcf/0050"
def get_ETFPCF_yuanta_2():
try:
session = HTMLSession()
response = session.get(url)
print(response.status_code)
script = """
document.querySelector('div.moreBtn').click()
"""
response.html.render(sleep=8,wait=8, script=script, timeout=8, keep_page=True)
# Parse HTML content
soup = BeautifulSoup(response.html.html, 'html.parser')
response.html.session.close()
# Find the table body (tbody)
tbody = soup.find('div', class_='tbody')
if tbody:
# Extract all rows from the table body
rows = tbody.find_all('div', class_='tr')
# Process each row
for row in rows:
# Extract all cells (td or th elements)
cells = row.find_all('div', class_='td')
row_data = [cell.find('span', class_='').text for cell in cells]
print(row_data)
else:
print("No tbody element found on the page.")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == '__main__':
get_ETFPCF_yuanta_2()
<script>withwindow.__NUXT__=...- so maybe you could try to extract data from this part of HTML. But this may need to create more complex code with text.find(), slicing, maybe regex. Or maybe it could run some JavaScript code which could extract it.