Python Web Scraping - Two Different Parent Class Names, Different Structures but same Child Class Names

Question

I'm trying to scrape the news articles from prnewswire.com. Each article is stored in a div called "row". Picture of this section HTML Code of prnewswire.com

The problem for me is that some article previews have an image beside their title and description. Therefor under the "row"-classes it's either the class name "card" (with image) or "col-sm-12 card" (without image): Can be seen in this screenshot

My current code is the following:

import requests
from bs4 import BeautifulSoup
import pandas

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

prnewswire_articles = soup.find_all('div', {'class': 'col-sm-12 card'})

for item in prnewswire_articles:
    prnewswire_article = {
        'page': page,
        'article_title': item.find('h3').text,
        'article_link': 'https://www.prnewswire.com/' +
                        item.find('a')['href'],
        'article_description': item.find('p').text,
    }
    articlelist.append(prnewswire_article)
return

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df.head())
print(len(df))
df.to_excel('PRNewsWire.xlsx', index=False)
print('Finished.')

I have discovered the following: In the line where I declare "prnewswire_articles" and look for a div with a certain class name, I get the results I want with the class "col-sm-12 card". But "card" or "row" doesn't work.

I noticed that the html structure of "card" classes is different to "col-sm-12 card" classes, but they both contain one "h3" element (the article's title), one "a href" and one "p" element Can be seen here

This is the error message I get when using "row" or "card" as class name:

Traceback (most recent call last):

File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 33, in <module>
    getarticles(x)
  File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 24, in getarticles
    'article_title': item.find('h3').text,
AttributeError: 'NoneType' object has no attribute 'text'

I've searched a whole day and didn't find anything. Just recently started learning Python, so I'm sorry if this is a stupid mistake, but I am at the end of finding an answer. Would really appreciate help a lot! :)

FYI "scrapping" means throwing away, as you do with rubbish. The correct term is scraping. — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Mar 8, 2021 at 15:41
Oh, oversaw that, sorry. Was already changed when I wanted to edit it. English is only my second language, didn't know it better :) — Niklas Klotz
– Niklas Klotz, Commented Mar 8, 2021 at 15:49

Andrej Kesely · Accepted Answer · 2021-03-08 15:33:33Z

You can select all .row that are under class .card-list (using a CSS selector). I also changed how you extract the article title (just get the text following <small> element):

import pandas
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    prnewswire_articles = soup.select('.card-list > .row')    # <-- select all rows that are under class "card-list"

    for item in prnewswire_articles:
        prnewswire_article = {
            'page': page,
            'article_title': item.select_one('h3 small').find_next_sibling(text=True).strip(),   # <--- select text that is after <small> element
            'article_link': 'https://www.prnewswire.com/' +
                            item.find('a')['href'],
            'article_description': item.find('p').get_text(strip=True, separator='\n'),
        }
        articlelist.append(prnewswire_article)

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df)

Prints:

...

186     2  Wugen Announces Exclusive Partnership Agreemen...  https://www.prnewswire.com//news-releases/wuge...  Wugen Inc., a clinical-stage biotechnology com...
187     2  Upstryve Initiates Mentor Network for Trade St...  https://www.prnewswire.com//news-releases/upst...  Upstryve Inc (Upstryve) www.upstryve.com. Upst...
188     2  Inkling Simplifies Integration to Learning and...  https://www.prnewswire.com//news-releases/inkl...  Inkling, a global leader in digital learning p...
189     2  CommerceHub to Participate in Bank of America'...  https://www.prnewswire.com//news-releases/comm...  CommerceHub, a leading provider of ecommerce s...
190     2        Instrument Promotes Kara Place to President  https://www.prnewswire.com//news-releases/inst...  Instrument, a digitally focused, creative agen...
191     2   Regent Properties Announces Executive Promotions  https://www.prnewswire.com//news-releases/rege...  Regent Properties ("Regent"), a real estate in...
192     2  PowerPay Hits $1 Billion in Home Renovations L...  https://www.prnewswire.com//news-releases/powe...  PowerPay, the nation's fastest-growing home im...
193     2  GoldenTree Announces Closing of $698 Million C...  https://www.prnewswire.com//news-releases/gold...  GoldenTree Loan Management II ("GLM II") and i...
194     2  The Real-Time Moving Show on the Screen "Showt...  https://www.prnewswire.com//news-releases/the-...  EnableWow (www.showtap.com) launched a new pre...
195     2  Black Knight: Lock Activity Suggests Q1 2021 R...  https://www.prnewswire.com//news-releases/blac...  Today, the Data & Analytics division of Black ...
196     2  LG Innotek Joins Hands with Microsoft to Proli...  https://www.prnewswire.com//news-releases/lg-i...  LG Innotek (CEO Cheoldong Jeong) announced on ...
197     2  MicroWorkers Integrates Ontology's ONTO Wallet...  https://www.prnewswire.com//news-releases/micr...  To bridge micro workers globally, Ontology and...
198     2     MemVerge Introduces M3 Channel Partner Program  https://www.prnewswire.com//news-releases/memv...  MemVerge™, the pioneers of Big Memory software...
199     2  Innovative Deals Spur the Growth of New Sports...  https://www.prnewswire.com//news-releases/inno...  Last year has shaped up to be crucial for the ...

Can't say more other than thank you. It worked. You can't believe how much time and energy you've saved me <3

Collectives™ on Stack Overflow

Python Web Scraping - Two Different Parent Class Names, Different Structures but same Child Class Names

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related