Extract information from html list to pandas df/list/dict (python 3.0)

Question

I have the source code of a website which contains several lists. Now I would like to extract information of these lists into a usable format in python.

For example, see the the first list entry of a list of countries below:

<ul class='checklist__list'>

    <li class=' checklist__item' id='checklist__item--country-111'>
      <label class='checklist__label ripple-animation'>
        <input  class="checklist__input js-checklist__input idb-on-change" type="checkbox" id="111" name="country" value="111">
          Germany
        </input>
      </label>
    </li>

Say, I am now interested in the country id (here: 111) and the matching country name (here: Germany) and would like to have that in a usable format in python, for example a pandas dataframe or dictionary.

Does anyone know an easy way to do that? The original list contains >100 countries.

Thank you very much for suggestions!

Looks like beautiful soup would be the easy thing here

SuperStew
– SuperStew

2018-05-02 14:56:28 +00:00
Commented May 2, 2018 at 14:56 — SuperStew
– SuperStew, Commented May 2, 2018 at 14:56

gaw89 · Accepted Answer · 2018-05-02 15:19:24Z

1

You can solve this problem easily with BeautifulSoup. Given the markup you've posted in your question, this code snippet should extract the id and label:

from bs4 import BeautifulSoup as bs
html = """<ul class='checklist__list'>
            <li class=' checklist__item' id='checklist__item--country-111'>
              <label class='checklist__label ripple-animation'>
              <input  class="checklist__input js-checklist__input idb-on-change" type="checkbox" id="111" name="country" value="111">
                Germany
              </input>
              </label>
            </li>"""

soup = bs(html)
label = soup.find("label").text
id = soup.find("input").get("value")

You will have to clean the label as there are some extraneous spaces and newline characters in the output, but you should be able to extend this example however you need for further processing of these items.

To process multiple list items that all have the same markup format as above, you can use this snippet:

lis = soup.find_all("li")  # This will return a list of all line items in the markup.
for li in lis:
    li_label = li.find("label").text
    li_id = li.find("input").get("id")
    print(li_label, li_id)

answered May 2, 2018 at 15:19

gaw89

1,0689 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nini Over a year ago

Thanks, that worked nicely! I removed the spaces & line breaks with the following line li_country= ' '.join(li.find("label").text.split())

Collectives™ on Stack Overflow

Extract information from html list to pandas df/list/dict (python 3.0)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related