Parsing XML repeated child root in Python using BeautifulSoup

Question

So I've run into an issue where I've been parsing an XML file like so:

soup = BeautifulSoup(xml_string, "lxml")  
pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        assign = soup.findAll("assignee")
        pat_cite = soup.findAll("patcit")

        for item1 in assign: 
            if item.find("orgname"):
                org_name = item.find("orgname").text

        for item2 in pat_cite:
            if item2.find("name"):
                name = item2.find("name").text


        for inv_name, pat_num, cpc_num, class_num, subclass_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("section"), soup.findAll("class"), soup.findAll("subclass"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):

            writer.writerow([inv_name.text, pat_num.text, org_name, cpc_num.text, class_num.text, subclass_num.text, date_num.text, country.text, city.text, state.text, name])

I was limited to only a few elements (as shown in the text entries at the end) but I now have about 10 more parent elements with over 30 more child elements I need to parse so explicitly stating them all out like this won't really work well anymore. Also, I have repeats in the data which looks like:

<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>

I would like this to be able to parse repeated child roots (such as patcit) into my CSV file as columns like so:

invention name  country   city  .... patcit name1  patcit date1....
              white space            patcit name2  patcit date2....
              white space            patcit name2  patcit date3....

And so on....because each invention has more than one citation or reference it will have only one column of most of the other information.

I dont see a publication-reference, assignee or orgname XMLReference. If you don't mind can you share a minimal sample of the XML document that we can work with. — Oluwafemi Sule
– Oluwafemi Sule, Commented Nov 19, 2017 at 13:26
Yes Oluwafemi, I did not include the entire XML file in this sample because it is simply too large. They are definitely there. I have added more of the file, however, I'm only having difficulties on the portion I have added already. The others parse through easily already. — HelloToEarth
– HelloToEarth, Commented Nov 21, 2017 at 0:17

SIM · Accepted Answer · 2017-11-22 05:43:48Z

1

Try the below script. I suppose this is what you wanted to have.

from bs4 import BeautifulSoup

xml_content='''
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
'''
soup = BeautifulSoup(xml_content,"lxml")
for item in soup.select("patcit[num^=000]"):
    name = item.select("name")[0].text
    date = item.select("date")[0].text
    kind = item.select("kind")[0].text
    doc_number = item.select("doc-number")[0].text
    country = item.select("country")[0].text
    print(name,date,kind,doc_number,country)

Results:

Haskell 19260600 A 1589850 US
Orme, Jr. 19421100 S D134414 US

This solution is for the link you provided later:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/")
soup = BeautifulSoup(res.text,"lxml")
table = soup.select("table")[1]
for items in table.select("tr"):
    data = ' '.join([item.text for item in items.select("td")])
    print(data)

edited Nov 22, 2017 at 5:43

answered Nov 19, 2017 at 13:32

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

HelloToEarth Over a year ago

I tried running it through along with my other wanted tags but your method only captured the first entry for each case, stored it, and kept running with the same one over all my loops. So, I devised my own method which works now, however, it only grabs one of the many citations in each case. I'd like to store the multiple citations in each entry (like Haskell and Orme, Jr) down a column in excel format (which I use a writerow function to do so).

SIM Over a year ago

Without seeing the real website it is hard to give a solution which may not go wrong. If possible, paste here the website address, then I'm gonna take a look.

HelloToEarth Over a year ago

bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017 Any of these XMLs will suffice as they're all in the exact same format as shown above.

HelloToEarth Over a year ago

Hmm...Well this solution just prints out the data on the actual site as a table but I'm parsing the XML files on the site into an CSV file read through excel. Problem is finding a way to plug all the repeated child nodes (like patcit above) into a column.

HelloToEarth Over a year ago

I've made an edit to the end of my original post that might help in explaining what I want. I feel miscommunication has happened on my end.

Collectives™ on Stack Overflow

Parsing XML repeated child root in Python using BeautifulSoup

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related