The following is a simple code I wrote in python to scrape specific information from numerically ascending URLs. It works great, and I can see the results in python IDLE.
import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
for i in range(35, 345, 1):
url = 'https://www.example.com/ID=' + str(i)
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
information1=soup.find(text='sam')
information2=soup.find(text='john')
print(information1,information2,i)
so the results look like this:
None None 35
None None 36
None sam 37
john None 38
None None 39
....
None None 345
Now this is great and is what I need, but I would like to improve my code by having the execution stop at "john None 38" when everything I need is found. So there won't be the unnecessary additional 300 plus lines.
Now there are two things you should know. First, information1 and information2 will never be in the same webpage. They will always be on separate URLs. Second, information1 appeared first before information2 in the above code, but the reverse is also possible if I changed the string to something else I'm looking for.
So the solution needs to incorporate the fact that information1 and information2 will appear in the results at different rows, and that information1 could appear first or second and vice versa.
I'm really struggling to form "if" code with the above mentioned conditions. I'd appreciate any help. Thank you.