I would like to use Python to scrape all links on the Civil Procedure URL of the Montana Code Annotated, as well as all pages linked on that page, and eventually capture the substantive text at the last link. The problem is that the base URL links to Chapters that also have URLs to Parts. And the Parts URLs have links to the text I want. So this is a "three deep" URL structure with a URL naming convention that does not use a sequential ending, like 1,2,3,4,etc.
I am new to Python, so I broke this down into steps.
FIRST, I used this to extract the text from a single URL with substantive text (i.e., three levels deep):
import requests
from bs4 import BeautifulSoup
url = 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
SECOND, I created a list of URLS and exported it to a .txt file:
import os
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/sections_index.html")
soup = BeautifulSoup(html_page, "html.parser")
url_base="https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/"
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:])
os.chdir("/home/rsync/Downloads/")
with open("All_URLs.txt", "w") as f:
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:], file = f)
f.close()
THIRD, I tried scrape the text from the resulting URL list:
import os
import requests
from bs4 import BeautifulSoup
url_lst = [ 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
]
for link in url_lst:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
for link in url_lst:
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
My plan was to put it all together into a single script (after figuring out how to extract URLs that are three levels deep from the base URL). But the third script iterates on itself without printing separate pages for each URL, resulting in just the text from the last URL.
Any tips on how to either fix the third script so it scrapes and prints the text from all 16 of the URLs from the second script would be welcome! As would ideas on how to "pull this together" into something less convoluted.
url_lsttwice? The first loop processes each URL, the 2nd loop writes to a file (n * n) times.