0

Disclaimer: This is my first foray into web scraping

I have a list of URLs corresponding to search results, e.g.,

http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662

I'm trying to use Selenium to access the HTML of the result as follows:

for url in detail_urls:
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    print(soup.prettify())

However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:

  1. http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662

  2. https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662

  3. https://www.vinelink.com/#/searchResults/1

Does anyone have a tip on how to access the final search results data?

Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?

I am able to Inspect the information I need per this image:

enter image description here

5
  • What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup Commented Nov 23, 2018 at 4:30
  • Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects? Commented Nov 23, 2018 at 5:32
  • When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself? Commented Nov 23, 2018 at 6:33
  • Try calling soup.find_all("search-result") to confirm that you are not getting the data you need Commented Nov 23, 2018 at 6:35
  • Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this! Commented Nov 23, 2018 at 7:12

1 Answer 1

1

Once you have your soup variable with the HTML follow the code below..

import json
data = soup.find('search-result')['data']
print(data)

Output: Now treat each value like a dict.

{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}

Next:

info = json.loads(data)

print(info['first_name'], info['last_name'])

#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.