1

i am trying to scrape author name and author url from the following webpage.

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive

and i am using following code;

    author_flag = 0
    divs = soup.find_all('h2')
    for div in divs:
        author = div.find('a')
        if(author is not None):
            author_art.append(author.text)
            author_url.append('https://medium.com'+ author.get('href'))
            aurhor_flag = 1
            break
        if(author_flag==0):
            author_art.append('Author information missing')
            author_url.append('Author Url information missing')

can anyone take a look what i am doing wrong in this? As this code is not picking anything. its is just returning blank list.

Full code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re 

data = pd.read_csv('url_technology.csv')


author_art = []
author_url = []


for i in range(1): 
    try:   
    
        author_flag = 0
        divs = soup.find_all('meta')
        for div in divs:
            author = div.find('span')
            if(author is not None):
                author_art.append(author.text)
               
              author_url.append('https://medium.com'+author.get('href'))
                aurhor_flag = 1
                break
            if(author_flag==0):
                author_art.append('Author information missing')
                author_url.append('Author Url information missing')


    except:  
        print('no data found')
    
author_art = pd.DataFrame(title)
author_url = pd.DataFrame(url)


res = pd.concat([author_art, author_art] , axis=1)
res.columns = ['Author_Art', 'Author_url']
res.to_csv('combined1.csv')
print('File created successfully')

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive---------0----------------------- https://medium.com/job-advice-for-software-engineers/what-i-want-and-dont-want-to-see-on-your-software-engineering-resume-cbc07913f7f6?source=tag_archive---------1----------------------- https://itnext.io/load-testing-using-apache-jmeter-af189dd6f805?source=tag_archive---------2----------------------- https://medium.com/s/story/black-mirror-bandersnatch-a-study-guide-c46dfe9156d?source=tag_archive---------3----------------------- https://medium.com/fast-company/the-worst-design-crimes-of-2018-56f32b027bb7?source=tag_archive---------4----------------------- https://towardsdatascience.com/make-your-pictures-beautiful-with-a-touch-of-machine-learning-magic-31672daa3032?source=tag_archive---------5----------------------- https://medium.com/hackernoon/the-state-of-ruby-2019-is-it-dying-509160a4fb92?source=tag_archive---------6-----------------------

3
  • Can you provide more complete code, preferably a minimal reproducible example? You've already presumably made requests and parsed the content, but seeing how you've done so can illuminate the issue. Commented Jun 22, 2021 at 16:50
  • 1
    sure, i added full code, i really appreciate your help. Commented Jun 22, 2021 at 16:56
  • In full code I have loop which i just used it one time only. I have 5000 webpages which I have to scrape into an excel sheet. Commented Jun 22, 2021 at 16:58

2 Answers 2

1

One possibility how to get author Name and author URL is to parse the Ld+Json data embedded within the page:

import json
import requests
from bs4 import BeautifulSoup

url = "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one('[type="application/ld+json"]').contents[0])

# uncomment this to print all LD+JSON data:
# print(json.dumps(data, indent=4))

print("Author:", data["author"]["name"])
print("URL:", data["author"]["url"])

Prints:

Author: Eric Elliott
URL: https://medium.com/@_ericelliott

EDIT: A function that returns Author Name/URL:

import json
import requests
from bs4 import BeautifulSoup


def get_author_name_url(medium_url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(
        soup.select_one('[type="application/ld+json"]').contents[0]
    )
    return data["author"]["name"], data["author"]["url"]


url_list = [
    "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20",
]

for url in url_list:
    name, url = get_author_name_url(url)
    print("Author:", name)
    print("URL:", url)
Sign up to request clarification or add additional context in comments.

5 Comments

In above full code I have loop which i just run it one time only(for testing). I have 5000 webpages which I have to scrape into an excel sheet. How to put this thing into above loop and Dataframe that i could load into excel sheet?
@Qasim0787 Yes, make a function from my script that returns Name/URL and put it inside a loop.
It is working for this specific url but not for others.
@Qasim0787 Yes, every server/site is a little bit different. There isn't universal one code for all sites.
can you pls check if there is some similarity among them.. or we can make generic one. i have added few url above as reference
0

I've launched a python package called medium-apis to do such tasks.

  1. Install medium-apis
pip install medium-apis
  1. Get you RapidAPI key. See how

  2. Run the code:

from medium_apis import Medium

medium = Medium('YOUR_RAPIDAPI_KEY')

def get_author(url):
  url_without_parameters = url.split('?')[0]
  article_id = url_without_parameters.split('-')[-1]

  article = medium.article(article_id=article_id)
  author = article.author

  author.save_info()

  return author

urls = [
  "https://nishu-jain.medium.com/medium-apis-documentation-3384e2d08667",
]

for url in urls:
  author = get_author(url)
  print('Author: ', author.fullname)
  print('Profile URL: ', f'https://medium.com/@{author.username}')


Github repo: https://github.com/weeping-angel/medium-apis

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.