0

I am trying to use beautiful soup to scrape images from multiple URLs, and then write the URLs and the images to a file. The file format would look like:

TEXT OF URL_1

img_1 (ACTUAL IMAGE SHOWN)

img_2 (ACTUAL IMAGE SHOWN)

TEXT OF URL_2

img_1 (ACTUAL IMAGE SHOWN)

The first few lines of my output file right now looks like:

Company : Firehydrant     URL : https://www.firehydrant.io/âPNG


IHDRLf9ÃŒ∫   pHYsöúYiTXtXML:com.adobe.xmp<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c148 79.164036, 2019/08/13-01:06:57  

...

How can I view my file with the images actually showing instead of the binary? Or is there a different way of doing this? Sorry if this is a really stupid question!!

Here is my code right now for 1 website:


with open(file_name, 'wb') as img_file:

    option = webdriver.ChromeOptions()
    option.add_argument(" — incognito")
    browser = webdriver.Chrome(executable_path='./chromedriver', chrome_options=option)

    url = 'https://www.firehydrant.io/'

    browser.get(url)
    timeout = 10
    WebDriverWait(browser, timeout)

    soup = BeautifulSoup(browser.page_source, 'html.parser')
    images = soup.find_all("img")

    found_first_image = False
    for image in images:
        src = image['src']
        if(found_first_image == False): # ADD THE TEXT FOR THE COMPANY/URL
            found_first_image = True
            string = ("URL : " + url).encode('utf-8') 
            img_file.write(string)

        # removing everything after the '?' if there is one in the src tag
        src = urljoin(url, src)
        if("?" in src):
            pos = src.index("?")
            src = src[:pos]
        parsed = urlparse(src)
        if(bool(parsed.netloc) and bool(parsed.scheme)): # download the image and write it to the file
            response = requests.get(src)
            URLFile.write(response.content)
4
  • 1
    How do you want to actually view the file? The file might be better structured as json with base64 encoding for the images embedded in a json value or store the files separately and zip all the files up for distribution Commented May 29, 2020 at 8:42
  • 1
    The way you need to think about this: files don't actually contain images or text or anything else. They contain bytes - raw data. It's up to the program you use to view the file, to decide what that data actually means, and display something appropriate. When you view, for example, a .html file in a text editor, it looks different from what you see in a web browser - because those differing interpretations are forced onto the same data by those programs. Commented May 29, 2020 at 8:46
  • I was hoping to just view it in text edit - there are thousands of websites being crawled, and I don't want to store all of the images in various folders/directory and click through one-by-one. If I don't write the text I am printing to my file (the url/company name) and just write the images and then open it in text edit, I can see the first image visibly but not any of the rest. And when I keep the text mixed with the images, I see all of the text and images as symbols when opening in text edit. Commented May 30, 2020 at 6:00
  • @jbug123 have you found a solution? Commented Nov 4, 2022 at 12:13

1 Answer 1

1

Store your images as image files outside rather than putting them here as text. Just keep a unique id for the image when you write it and put that unique id in your text file instead of the image. For saving image, cv2.imread can be used and for generating unique number use:

import uuid

uuid.uuid1().hex
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Abhishek, though I would prefer to not save each image as a file. Is there a workaround to my original question/code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.