how to scrape author name and author url from a webpage using python

Question

i am trying to scrape author name and author url from the following webpage.

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive

and i am using following code;

    author_flag = 0
    divs = soup.find_all('h2')
    for div in divs:
        author = div.find('a')
        if(author is not None):
            author_art.append(author.text)
            author_url.append('https://medium.com'+ author.get('href'))
            aurhor_flag = 1
            break
        if(author_flag==0):
            author_art.append('Author information missing')
            author_url.append('Author Url information missing')

can anyone take a look what i am doing wrong in this? As this code is not picking anything. its is just returning blank list.

Full code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re 

data = pd.read_csv('url_technology.csv')


author_art = []
author_url = []


for i in range(1): 
    try:   
    
        author_flag = 0
        divs = soup.find_all('meta')
        for div in divs:
            author = div.find('span')
            if(author is not None):
                author_art.append(author.text)
               
              author_url.append('https://medium.com'+author.get('href'))
                aurhor_flag = 1
                break
            if(author_flag==0):
                author_art.append('Author information missing')
                author_url.append('Author Url information missing')


    except:  
        print('no data found')
    
author_art = pd.DataFrame(title)
author_url = pd.DataFrame(url)


res = pd.concat([author_art, author_art] , axis=1)
res.columns = ['Author_Art', 'Author_url']
res.to_csv('combined1.csv')
print('File created successfully')

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive---------0----------------------- https://medium.com/job-advice-for-software-engineers/what-i-want-and-dont-want-to-see-on-your-software-engineering-resume-cbc07913f7f6?source=tag_archive---------1----------------------- https://itnext.io/load-testing-using-apache-jmeter-af189dd6f805?source=tag_archive---------2----------------------- https://medium.com/s/story/black-mirror-bandersnatch-a-study-guide-c46dfe9156d?source=tag_archive---------3----------------------- https://medium.com/fast-company/the-worst-design-crimes-of-2018-56f32b027bb7?source=tag_archive---------4----------------------- https://towardsdatascience.com/make-your-pictures-beautiful-with-a-touch-of-machine-learning-magic-31672daa3032?source=tag_archive---------5----------------------- https://medium.com/hackernoon/the-state-of-ruby-2019-is-it-dying-509160a4fb92?source=tag_archive---------6-----------------------

Can you provide more complete code, preferably a minimal reproducible example? You've already presumably made requests and parsed the content, but seeing how you've done so can illuminate the issue. — h0r53
– h0r53, Commented Jun 22, 2021 at 16:50
In full code I have loop which i just used it one time only. I have 5000 webpages which I have to scrape into an excel sheet. — Qasim0787
– Qasim0787, Commented Jun 22, 2021 at 16:58

Andrej Kesely · Accepted Answer · 2021-06-22 17:05:26Z

1

One possibility how to get author Name and author URL is to parse the Ld+Json data embedded within the page:

import json
import requests
from bs4 import BeautifulSoup

url = "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one('[type="application/ld+json"]').contents[0])

# uncomment this to print all LD+JSON data:
# print(json.dumps(data, indent=4))

print("Author:", data["author"]["name"])
print("URL:", data["author"]["url"])

Prints:

Author: Eric Elliott
URL: https://medium.com/@_ericelliott

EDIT: A function that returns Author Name/URL:

import json
import requests
from bs4 import BeautifulSoup


def get_author_name_url(medium_url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(
        soup.select_one('[type="application/ld+json"]').contents[0]
    )
    return data["author"]["name"], data["author"]["url"]


url_list = [
    "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20",
]

for url in url_list:
    name, url = get_author_name_url(url)
    print("Author:", name)
    print("URL:", url)

edited Jun 22, 2021 at 17:05

answered Jun 22, 2021 at 16:56

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Qasim0787 Over a year ago

In above full code I have loop which i just run it one time only(for testing). I have 5000 webpages which I have to scrape into an excel sheet. How to put this thing into above loop and Dataframe that i could load into excel sheet?

Andrej Kesely Over a year ago

@Qasim0787 Yes, make a function from my script that returns Name/URL and put it inside a loop.

Qasim0787 Over a year ago

It is working for this specific url but not for others.

Andrej Kesely Over a year ago

@Qasim0787 Yes, every server/site is a little bit different. There isn't universal one code for all sites.

Qasim0787 Over a year ago

can you pls check if there is some similarity among them.. or we can make generic one. i have added few url above as reference

weeping-angel · Accepted Answer · 2022-02-19 13:17:13Z

I've launched a python package called medium-apis to do such tasks.

Install medium-apis

pip install medium-apis

Get you RapidAPI key. See how
Run the code:

from medium_apis import Medium

medium = Medium('YOUR_RAPIDAPI_KEY')

def get_author(url):
  url_without_parameters = url.split('?')[0]
  article_id = url_without_parameters.split('-')[-1]

  article = medium.article(article_id=article_id)
  author = article.author

  author.save_info()

  return author

urls = [
  "https://nishu-jain.medium.com/medium-apis-documentation-3384e2d08667",
]

for url in urls:
  author = get_author(url)
  print('Author: ', author.fullname)
  print('Profile URL: ', f'https://medium.com/@{author.username}')

Github repo: https://github.com/weeping-angel/medium-apis

Collectives™ on Stack Overflow

how to scrape author name and author url from a webpage using python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related