1

In this link https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry I want the code to print the patent citations which should give publication number, title.

I then want to use pandas to put publication number in a column and the title in another column. so far I have used beautiful soup to convert the HTML file into a readable format. I have selected backward references HTML tag and under that I want it to print the publication number and title of the citations. I am citing one single example, but I have a folder full of HTML files which I will do later.

x = soup.select('tr[itemprop="backwardReferences"]')
y = soup.select('td[itemprop="title"]')  # this line gives all the titles in the document not particularly under the patent citations
print(y)
print(y)

1 Answer 1

1

You can use the following css selector combination. select_one ensures it matches the first table. If you worry about table order changing, you can add :not to exclude the other table, based on the text for the second (Non-Patent Citationstable) with:

pd.read_html(str(soup.select('section:has(h2:contains("Patent Citations"):not(:contains("Non-Patent Citations"))) > table')))

Note:

  1. That whilst the webpage, when rendered, visually displays 2 results for Patent Citations, there is only 1 located in this table in page-source, and therefore in requests content.
  2. I have used pandas, as you stated you will use this import anyway, to generate the tabular output and subset specific columns.
  3. You can use pd.concat() to combine dataframe in a loop over multiple files to generate a final, single, df.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
 
r = requests.get('https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry')
soup = bs(r.content, 'lxml')
df = pd.read_html(str(soup.select_one('section:has(h2:contains("Patent Citations")) > table')))[0]
print(df.loc[: , ['Publication number', 'Title']])
Sign up to request clarification or add additional context in comments.

6 Comments

Yes it works. Can you explain what it means when you write table? I understand you select header h2 from HTML tags. And then what us >table mean? Why does the code not pick the second citations which I see on the URL link
f I check this link for example:patents.google.com/patent/US4458945?oq=US4458945A I think the HTML tags are different. I just can't get it to print the correct patent citations(16- the family to family citations don't get printed). I tried section:has(tr:contains("backwardReferences"))>table and it doesn't work. How do I use the tag tr itemprop=backwardReferences" I think that may give all the citations
The > table means table is a direct child of whatever is on left of >. If the returned html from requests, there is only 1 citation in the table rather than two. I assume js running on the webpage alters how this looks when viewing the page on the website. This doesn't run when using requests.
Your question asked for patent citations so I only returned those. Are you now saying you want all citations?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.