beautifulSoup soup.select() returning empty for css selector

Question

I am trying to parse some links from this site https://news.ycombinator.com/

I want to select a specific table

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

I know there css selector limitations for bs4. But the problem is I can't even select as simple as #hnmain > tbody with soup.select('#hnmain > tbody') as it is returning empty

with below code, I'm unable to parse tbody whereas the with js I did (screenshot)

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

OUT:

soup=BeautifulSoup(html)
[]

Rabin Adhikari · Accepted Answer · 2019-10-20 05:09:43Z

2

I am not getting the html tag tbody from beautifulsoup or the curl script. It means

soup.select('tbody')

returns empty list. This is the same reason for you to get an empty list.

To just extract the links you are looking for just do

soup.select("a.storylink")

It will get the links that you want from the site.

answered Oct 20, 2019 at 5:09

Rabin Adhikari

3502 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Adil Saju Over a year ago

thank you!! your approach is good for getting the main story links, but i need to fetch more attribute from each post, like upvote_count, comment_count, url_to_right_of_main_url, posted_ago where these items are span/a without having a specific classes. please help!

Joseph Rajchwald · Accepted Answer · 2019-10-20 04:46:08Z

2

Instead of going to through the body and table why not go directly to the links? I tested this and it worked well:

links=soup.select('a',{'class':'storylink'})

If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.

table = soup.select('table')

answered Oct 20, 2019 at 4:46

Joseph Rajchwald

4875 silver badges14 bronze badges

4 Comments

Adil Saju Over a year ago

theres total of 3 tables in that page

Joseph Rajchwald Over a year ago

oh yeah my bad. Well then you can parse based off of class ID or something else but I wouldn't go go through the attribute hierarchy if I were you.

Adil Saju Over a year ago

your's approach is good for getting the main story links, but i need to fetch more attribute from each post, like upvote_count, comment_count, site_url where these items are span/a without having a specific classes. please help

Joseph Rajchwald Over a year ago

There are <td> attributes with class ID of 'subtext' which has a lot of the post info. The span attributes also have ID's depending on the information stored - look through the HTML carefully and find patterns for the information you want to grab.

QHarr · Accepted Answer · 2019-10-20 08:31:07Z

Data is arranged in groups of 3 rows where the third row is an empty row used for spacing. Loop the top rows and use next_sibling to grab the associated second row at each point. bs4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

Collectives™ on Stack Overflow

beautifulSoup soup.select() returning empty for css selector

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related