I am trying to use beautiful soup to scrape images from multiple URLs, and then write the URLs and the images to a file. The file format would look like:
TEXT OF URL_1
img_1 (ACTUAL IMAGE SHOWN)
img_2 (ACTUAL IMAGE SHOWN)
TEXT OF URL_2
img_1 (ACTUAL IMAGE SHOWN)
The first few lines of my output file right now looks like:
Company : Firehydrant URL : https://www.firehydrant.io/âPNG
IHDRLf9ÃŒ∫ pHYsöúYiTXtXML:com.adobe.xmp<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c148 79.164036, 2019/08/13-01:06:57
...
How can I view my file with the images actually showing instead of the binary? Or is there a different way of doing this? Sorry if this is a really stupid question!!
Here is my code right now for 1 website:
with open(file_name, 'wb') as img_file:
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")
browser = webdriver.Chrome(executable_path='./chromedriver', chrome_options=option)
url = 'https://www.firehydrant.io/'
browser.get(url)
timeout = 10
WebDriverWait(browser, timeout)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all("img")
found_first_image = False
for image in images:
src = image['src']
if(found_first_image == False): # ADD THE TEXT FOR THE COMPANY/URL
found_first_image = True
string = ("URL : " + url).encode('utf-8')
img_file.write(string)
# removing everything after the '?' if there is one in the src tag
src = urljoin(url, src)
if("?" in src):
pos = src.index("?")
src = src[:pos]
parsed = urlparse(src)
if(bool(parsed.netloc) and bool(parsed.scheme)): # download the image and write it to the file
response = requests.get(src)
URLFile.write(response.content)