Extract specific html tag using python

Question

In this link https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry I want the code to print the patent citations which should give publication number, title.

I then want to use pandas to put publication number in a column and the title in another column. so far I have used beautiful soup to convert the HTML file into a readable format. I have selected backward references HTML tag and under that I want it to print the publication number and title of the citations. I am citing one single example, but I have a folder full of HTML files which I will do later.

x = soup.select('tr[itemprop="backwardReferences"]')
y = soup.select('td[itemprop="title"]')  # this line gives all the titles in the document not particularly under the patent citations
print(y)
print(y)

QHarr · Accepted Answer · 2021-04-25 07:59:50Z

1

You can use the following css selector combination. select_one ensures it matches the first table. If you worry about table order changing, you can add :not to exclude the other table, based on the text for the second (Non-Patent Citationstable) with:

pd.read_html(str(soup.select('section:has(h2:contains("Patent Citations"):not(:contains("Non-Patent Citations"))) > table')))

Note:

That whilst the webpage, when rendered, visually displays 2 results for Patent Citations, there is only 1 located in this table in page-source, and therefore in requests content.
I have used pandas, as you stated you will use this import anyway, to generate the tabular output and subset specific columns.
You can use pd.concat() to combine dataframe in a loop over multiple files to generate a final, single, df.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
 
r = requests.get('https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry')
soup = bs(r.content, 'lxml')
df = pd.read_html(str(soup.select_one('section:has(h2:contains("Patent Citations")) > table')))[0]
print(df.loc[: , ['Publication number', 'Title']])

edited Apr 25, 2021 at 7:59

answered Apr 25, 2021 at 7:54

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

astronaut Over a year ago

Yes it works. Can you explain what it means when you write table? I understand you select header h2 from HTML tags. And then what us >table mean? Why does the code not pick the second citations which I see on the URL link

astronaut Over a year ago

f I check this link for example:patents.google.com/patent/US4458945?oq=US4458945A I think the HTML tags are different. I just can't get it to print the correct patent citations(16- the family to family citations don't get printed). I tried section:has(tr:contains("backwardReferences"))>table and it doesn't work. How do I use the tag tr itemprop=backwardReferences" I think that may give all the citations

astronaut Over a year ago

stackoverflow.com/questions/67270690/…

QHarr Over a year ago

The > table means table is a direct child of whatever is on left of >. If the returned html from requests, there is only 1 citation in the table rather than two. I assume js running on the webpage alters how this looks when viewing the page on the website. This doesn't run when using requests.

QHarr Over a year ago

Your question asked for patent citations so I only returned those. Are you now saying you want all citations?

|

Collectives™ on Stack Overflow

Extract specific html tag using python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related