1

I'm trying to scrape articles on a website. And would like to get the src of images. I had a good go at it with a few attempts and my code can't seem to fetch all these src.

I am using Selenium 3.141.0 with Python 3.7. There are 4 things that I want to get: src of images, link to full article, headline, article snippet. I can scrape the rest successfully but not the src. I want to dump all of these details into a pandas dataframe.

This is the code of the website I'm trying to scrape.

<article class="w29" data-minarticles="1.00">
    <a href="something.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image" title="stock image" width="245" height="135" src="pic.JPG" class="">
                <noscript>
                  "<img src="pic.JPG" alt="stock image" title="stock image" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title
            </span>
        </h2>
        <p>
          "Article snippet"
        </p>
      </a>
      ::after
</article>
<article class="w29" data-minarticles="1.00">
    <a href="something2.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image2" title="stock image2" width="245" height="135" src="pic2.JPG" class="">
                <noscript>
                  "<img src="pic2.JPG" alt="stock image2" title="stock image2" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title 2
            </span>
        </h2>
        <p>
          "Article snippet 2"
        </p>
      </a>
</article>
<article class="w29" data-minarticles="1.00">
    <a href="something3.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image3" title="stock image3" width="245" height="135" src="pic3.JPG" class="">
                <noscript>
                  "<img src="pic3.JPG" alt="stock image3" title="stock image3" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title 3
            </span>
        </h2>
        <p>
          "Article snippet 3"
        </p>
      </a>
</article>

And this is my code:

driver.get(url)

# get sub posts
sub_posts = driver.find_elements_by_class_name("w29")

# get details
sub_list = []
for post in sub_posts:
    # Get the link to the full article
    sub_source = post.find_element_by_tag_name('a').get_attribute('href')
    # Get the src of the post 
    sub_photo = post.find_element_by_tag_name('img').get_attribute('src')
    # Get headline
    sub_headline = post.find_element_by_tag_name('h2').text
    # Get article snippet
    sub_snippet = post.find_element_by_tag_name('p').text
    sub_list.append([sub_photo, sub_source, sub_headline, sub_snippet])

post_df = pd.DataFrame(sub_list, columns=["photo", "source", "headline", "snippet"])

This is what I have tried and the result that I got in the dataframe, focusing on line of code to get the src of the post:

Attempt 1

sub_photo = post.find_element_by_tag_name('img').get_attribute('src')

Result of Attempt 1

For whatever reason, it scraped the first src and returns None for the rest of the articles.

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
None       something2.html  Article Title 2  Article Snippet 2
None       something3.html  Article Title 3  Article Snippet 3

Attempt 2

sub_photo = post.find_element_by_xpath('//*[@id="content"]/div[6]/div[1]/div[2]/article/a/figure/span/img').get_attribute('src')

Result of Attempt 2

It scraped the first src and returns the same, first src to the rest of the articles.

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
pic.JPG    something2.html  Article Title 2  Article Snippet 2
pic.JPG    something3.html  Article Title 3  Article Snippet 3

Attempt 3

sub_photo = post.find_element_by_css_selector('a>figure>span>img').get_attribute('innerHTML')

Result of Attempt 3

It scraped the first innerHTML and return the same, first innerHTML for the rest of the articles.

photo       source           headline         snippet
\n<img...   something.html   Article Title    Article Snippet
\n<img...   something2.html  Article Title 2  Article Snippet 2
\n<img...   something3.html  Article Title 3  Article Snippet 3

This is what I'm looking for:

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
pic2.JPG   something2.html  Article Title 2  Article Snippet 2
pic3.JPG   something3.html  Article Title 3  Article Snippet 3

Would appreciate if someone can point me to the right direction. Thank you.

3
  • Do you need to scroll page down to see all images? Commented Feb 9, 2019 at 8:31
  • Is there an URL we can use ? Commented Feb 9, 2019 at 8:33
  • @JaSON yes I have to scroll down Commented Feb 9, 2019 at 8:33

2 Answers 2

2

Initially only couple images are rendered, so you can either scroll page to the bottom to extract all @src values or you can extract @src (for visible images) OR @data-src (for hidden images):

sub_photo = post.find_element_by_tag_name('img').get_attribute('src') or post.find_element_by_tag_name('img').get_attribute('data-src')

This will return you the value of @src if it's not None OR the value of @data-src if @src is None

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @JaSON . Didn't know about @data-src!
1

For the first post the data is in the src attribute but then is in the data-src (in your code). See the following for example

for post in sub_posts:   
    ele = post.find_element_by_tag_name('img')
    val = ele.get_attribute('data-src') if ele.get_attribute('data-src') is not None else ele.get_attribute('src')
    print(val)

3 Comments

Actually ternary condition operator implementation is just more complicated version of my code with or operator
@JaSON I am new to python and wasn't aware I could shortcut with Or. +1 for you
In a = 0 or None or [] or "" or 1 expression the first value that can be evaluated to True will be assigned (so a will be 1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.