1

I am trying to parse some links from this site https://news.ycombinator.com/

I want to select a specific table

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

I know there css selector limitations for bs4. But the problem is I can't even select as simple as #hnmain > tbody with soup.select('#hnmain > tbody') as it is returning empty

with below code, I'm unable to parse tbody whereas the with js I did (screenshot)

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

OUT:

soup=BeautifulSoup(html)
[]

screenshot

3 Answers 3

2

I am not getting the html tag tbody from beautifulsoup or the curl script. It means

soup.select('tbody')

returns empty list. This is the same reason for you to get an empty list.

To just extract the links you are looking for just do

soup.select("a.storylink")

It will get the links that you want from the site.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you!! your approach is good for getting the main story links, but i need to fetch more attribute from each post, like upvote_count, comment_count, url_to_right_of_main_url, posted_ago where these items are span/a without having a specific classes. please help!
2

Instead of going to through the body and table why not go directly to the links? I tested this and it worked well:

links=soup.select('a',{'class':'storylink'})

If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.

table = soup.select('table')

4 Comments

theres total of 3 tables in that page
oh yeah my bad. Well then you can parse based off of class ID or something else but I wouldn't go go through the attribute hierarchy if I were you.
your's approach is good for getting the main story links, but i need to fetch more attribute from each post, like upvote_count, comment_count, site_url where these items are span/a without having a specific classes. please help
There are <td> attributes with class ID of 'subtext' which has a lot of the post info. The span attributes also have ID's depending on the information stored - look through the HTML carefully and find patterns for the information you want to grab.
1

Data is arranged in groups of 3 rows where the third row is an empty row used for spacing. Loop the top rows and use next_sibling to grab the associated second row at each point. bs4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.