1

The following is a simple code I wrote in python to scrape specific information from numerically ascending URLs. It works great, and I can see the results in python IDLE.

import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup

for i in range(35, 345, 1):
    url = 'https://www.example.com/ID=' + str(i)
    html = urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    information1=soup.find(text='sam')
    information2=soup.find(text='john')
    print(information1,information2,i)

so the results look like this:

None None 35
None None 36
None sam 37
john None 38
None None 39
....
None None 345

Now this is great and is what I need, but I would like to improve my code by having the execution stop at "john None 38" when everything I need is found. So there won't be the unnecessary additional 300 plus lines.

Now there are two things you should know. First, information1 and information2 will never be in the same webpage. They will always be on separate URLs. Second, information1 appeared first before information2 in the above code, but the reverse is also possible if I changed the string to something else I'm looking for.

So the solution needs to incorporate the fact that information1 and information2 will appear in the results at different rows, and that information1 could appear first or second and vice versa.

I'm really struggling to form "if" code with the above mentioned conditions. I'd appreciate any help. Thank you.

2 Answers 2

1
# Default to None
information1 = None
information2 = None
for i in range(35, 345, 1):
    ...
    # If already set don't override
    information1 = information1 or soup.find(text='sam')
    # Same here
    information2 = information2 or soup.find(text='john')
    if information1 and information2:
        # We have both information1 and information2 so break out of the for loop
        break
Sign up to request clarification or add additional context in comments.

Comments

0

You can store your trackers outside the loop, and have them remain between iterations:

import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup

info1 = None
info2 = None

for i in range(35, 345, 1):
    url = 'https://www.example.com/ID=' + str(i)
    html = urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    information1=soup.find(text='sam')
    information2=soup.find(text='john')

    if information1 is not None and info1 is None:
        info1 = information1

    if information2 is not None and info2 is None:
        info2 = information2

    if info1 and info2:
        break

print('Information 1: {}'.format(info1))
print('Information 2: {}'.format(info2))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.