I'm trying to scrape articles on a website. And would like to get the src of images. I had a good go at it with a few attempts and my code can't seem to fetch all these src.
I am using Selenium 3.141.0 with Python 3.7. There are 4 things that I want to get: src of images, link to full article, headline, article snippet. I can scrape the rest successfully but not the src. I want to dump all of these details into a pandas dataframe.
This is the code of the website I'm trying to scrape.
<article class="w29" data-minarticles="1.00">
<a href="something.html">
<figure class="left ">
<span class="img-a is-loaded">
<img alt="stock image" title="stock image" width="245" height="135" src="pic.JPG" class="">
<noscript>
"<img src="pic.JPG" alt="stock image" title="stock image" width="245" height="135" />"
</noscript>
</span>
</figure>
<h2>
<span>
Article Title
</span>
</h2>
<p>
"Article snippet"
</p>
</a>
::after
</article>
<article class="w29" data-minarticles="1.00">
<a href="something2.html">
<figure class="left ">
<span class="img-a is-loaded">
<img alt="stock image2" title="stock image2" width="245" height="135" src="pic2.JPG" class="">
<noscript>
"<img src="pic2.JPG" alt="stock image2" title="stock image2" width="245" height="135" />"
</noscript>
</span>
</figure>
<h2>
<span>
Article Title 2
</span>
</h2>
<p>
"Article snippet 2"
</p>
</a>
</article>
<article class="w29" data-minarticles="1.00">
<a href="something3.html">
<figure class="left ">
<span class="img-a is-loaded">
<img alt="stock image3" title="stock image3" width="245" height="135" src="pic3.JPG" class="">
<noscript>
"<img src="pic3.JPG" alt="stock image3" title="stock image3" width="245" height="135" />"
</noscript>
</span>
</figure>
<h2>
<span>
Article Title 3
</span>
</h2>
<p>
"Article snippet 3"
</p>
</a>
</article>
And this is my code:
driver.get(url)
# get sub posts
sub_posts = driver.find_elements_by_class_name("w29")
# get details
sub_list = []
for post in sub_posts:
# Get the link to the full article
sub_source = post.find_element_by_tag_name('a').get_attribute('href')
# Get the src of the post
sub_photo = post.find_element_by_tag_name('img').get_attribute('src')
# Get headline
sub_headline = post.find_element_by_tag_name('h2').text
# Get article snippet
sub_snippet = post.find_element_by_tag_name('p').text
sub_list.append([sub_photo, sub_source, sub_headline, sub_snippet])
post_df = pd.DataFrame(sub_list, columns=["photo", "source", "headline", "snippet"])
This is what I have tried and the result that I got in the dataframe, focusing on line of code to get the src of the post:
Attempt 1
sub_photo = post.find_element_by_tag_name('img').get_attribute('src')
Result of Attempt 1
For whatever reason, it scraped the first src and returns None for the rest of the articles.
photo source headline snippet
pic.JPG something.html Article Title Article Snippet
None something2.html Article Title 2 Article Snippet 2
None something3.html Article Title 3 Article Snippet 3
Attempt 2
sub_photo = post.find_element_by_xpath('//*[@id="content"]/div[6]/div[1]/div[2]/article/a/figure/span/img').get_attribute('src')
Result of Attempt 2
It scraped the first src and returns the same, first src to the rest of the articles.
photo source headline snippet
pic.JPG something.html Article Title Article Snippet
pic.JPG something2.html Article Title 2 Article Snippet 2
pic.JPG something3.html Article Title 3 Article Snippet 3
Attempt 3
sub_photo = post.find_element_by_css_selector('a>figure>span>img').get_attribute('innerHTML')
Result of Attempt 3
It scraped the first innerHTML and return the same, first innerHTML for the rest of the articles.
photo source headline snippet
\n<img... something.html Article Title Article Snippet
\n<img... something2.html Article Title 2 Article Snippet 2
\n<img... something3.html Article Title 3 Article Snippet 3
This is what I'm looking for:
photo source headline snippet
pic.JPG something.html Article Title Article Snippet
pic2.JPG something2.html Article Title 2 Article Snippet 2
pic3.JPG something3.html Article Title 3 Article Snippet 3
Would appreciate if someone can point me to the right direction. Thank you.