Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Question

Disclaimer: This is my first foray into web scraping

I have a list of URLs corresponding to search results, e.g.,

http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662

I'm trying to use Selenium to access the HTML of the result as follows:

for url in detail_urls:
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    print(soup.prettify())

However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:

Does anyone have a tip on how to access the final search results data?

Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?

I am able to Inspect the information I need per this image:

What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup — Joseph Choi
– Joseph Choi, Commented Nov 23, 2018 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects? — OJT
– OJT, Commented Nov 23, 2018 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself? — Joseph Choi
– Joseph Choi, Commented Nov 23, 2018 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need — Joseph Choi
– Joseph Choi, Commented Nov 23, 2018 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this! — OJT
– OJT, Commented Nov 23, 2018 at 7:12

Kamikaze_goldfish · Accepted Answer · 2018-11-23 04:18:11Z

1

Once you have your soup variable with the HTML follow the code below..

import json
data = soup.find('search-result')['data']
print(data)

Output: Now treat each value like a dict.

{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}

info = json.loads(data)

print(info['first_name'], info['last_name'])

#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

answered Nov 23, 2018 at 4:18

Kamikaze_goldfish

8611 gold badge12 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

OJT Over a year ago

Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?

OJT Over a year ago

I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.

Collectives™ on Stack Overflow

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related