1

I'm trying to scrape the news articles from prnewswire.com. Each article is stored in a div called "row". Picture of this section HTML Code of prnewswire.com

The problem for me is that some article previews have an image beside their title and description. Therefor under the "row"-classes it's either the class name "card" (with image) or "col-sm-12 card" (without image): Can be seen in this screenshot

My current code is the following:

import requests
from bs4 import BeautifulSoup
import pandas

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

prnewswire_articles = soup.find_all('div', {'class': 'col-sm-12 card'})

for item in prnewswire_articles:
    prnewswire_article = {
        'page': page,
        'article_title': item.find('h3').text,
        'article_link': 'https://www.prnewswire.com/' +
                        item.find('a')['href'],
        'article_description': item.find('p').text,
    }
    articlelist.append(prnewswire_article)
return

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df.head())
print(len(df))
df.to_excel('PRNewsWire.xlsx', index=False)
print('Finished.')

I have discovered the following: In the line where I declare "prnewswire_articles" and look for a div with a certain class name, I get the results I want with the class "col-sm-12 card". But "card" or "row" doesn't work.

I noticed that the html structure of "card" classes is different to "col-sm-12 card" classes, but they both contain one "h3" element (the article's title), one "a href" and one "p" element Can be seen here

This is the error message I get when using "row" or "card" as class name:

Traceback (most recent call last):

File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 33, in <module>
    getarticles(x)
  File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 24, in getarticles
    'article_title': item.find('h3').text,
AttributeError: 'NoneType' object has no attribute 'text'

I've searched a whole day and didn't find anything. Just recently started learning Python, so I'm sorry if this is a stupid mistake, but I am at the end of finding an answer. Would really appreciate help a lot! :)

2
  • FYI "scrapping" means throwing away, as you do with rubbish. The correct term is scraping. Commented Mar 8, 2021 at 15:41
  • Oh, oversaw that, sorry. Was already changed when I wanted to edit it. English is only my second language, didn't know it better :) Commented Mar 8, 2021 at 15:49

1 Answer 1

1

You can select all .row that are under class .card-list (using a CSS selector). I also changed how you extract the article title (just get the text following <small> element):

import pandas
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    prnewswire_articles = soup.select('.card-list > .row')    # <-- select all rows that are under class "card-list"

    for item in prnewswire_articles:
        prnewswire_article = {
            'page': page,
            'article_title': item.select_one('h3 small').find_next_sibling(text=True).strip(),   # <--- select text that is after <small> element
            'article_link': 'https://www.prnewswire.com/' +
                            item.find('a')['href'],
            'article_description': item.find('p').get_text(strip=True, separator='\n'),
        }
        articlelist.append(prnewswire_article)

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df)

Prints:

...

186     2  Wugen Announces Exclusive Partnership Agreemen...  https://www.prnewswire.com//news-releases/wuge...  Wugen Inc., a clinical-stage biotechnology com...
187     2  Upstryve Initiates Mentor Network for Trade St...  https://www.prnewswire.com//news-releases/upst...  Upstryve Inc (Upstryve) www.upstryve.com. Upst...
188     2  Inkling Simplifies Integration to Learning and...  https://www.prnewswire.com//news-releases/inkl...  Inkling, a global leader in digital learning p...
189     2  CommerceHub to Participate in Bank of America'...  https://www.prnewswire.com//news-releases/comm...  CommerceHub, a leading provider of ecommerce s...
190     2        Instrument Promotes Kara Place to President  https://www.prnewswire.com//news-releases/inst...  Instrument, a digitally focused, creative agen...
191     2   Regent Properties Announces Executive Promotions  https://www.prnewswire.com//news-releases/rege...  Regent Properties ("Regent"), a real estate in...
192     2  PowerPay Hits $1 Billion in Home Renovations L...  https://www.prnewswire.com//news-releases/powe...  PowerPay, the nation's fastest-growing home im...
193     2  GoldenTree Announces Closing of $698 Million C...  https://www.prnewswire.com//news-releases/gold...  GoldenTree Loan Management II ("GLM II") and i...
194     2  The Real-Time Moving Show on the Screen "Showt...  https://www.prnewswire.com//news-releases/the-...  EnableWow (www.showtap.com) launched a new pre...
195     2  Black Knight: Lock Activity Suggests Q1 2021 R...  https://www.prnewswire.com//news-releases/blac...  Today, the Data & Analytics division of Black ...
196     2  LG Innotek Joins Hands with Microsoft to Proli...  https://www.prnewswire.com//news-releases/lg-i...  LG Innotek (CEO Cheoldong Jeong) announced on ...
197     2  MicroWorkers Integrates Ontology's ONTO Wallet...  https://www.prnewswire.com//news-releases/micr...  To bridge micro workers globally, Ontology and...
198     2     MemVerge Introduces M3 Channel Partner Program  https://www.prnewswire.com//news-releases/memv...  MemVerge™, the pioneers of Big Memory software...
199     2  Innovative Deals Spur the Growth of New Sports...  https://www.prnewswire.com//news-releases/inno...  Last year has shaped up to be crucial for the ...
Sign up to request clarification or add additional context in comments.

1 Comment

Can't say more other than thank you. It worked. You can't believe how much time and energy you've saved me <3

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.