1

I have the source code of a website which contains several lists. Now I would like to extract information of these lists into a usable format in python.

For example, see the the first list entry of a list of countries below:

<ul class='checklist__list'>

    <li class=' checklist__item' id='checklist__item--country-111'>
      <label class='checklist__label ripple-animation'>
        <input  class="checklist__input js-checklist__input idb-on-change" type="checkbox" id="111" name="country" value="111">
          Germany
        </input>
      </label>
    </li>

Say, I am now interested in the country id (here: 111) and the matching country name (here: Germany) and would like to have that in a usable format in python, for example a pandas dataframe or dictionary.

Does anyone know an easy way to do that? The original list contains >100 countries.

Thank you very much for suggestions!

1
  • 3
    Looks like beautiful soup would be the easy thing here Commented May 2, 2018 at 14:56

1 Answer 1

1

You can solve this problem easily with BeautifulSoup. Given the markup you've posted in your question, this code snippet should extract the id and label:

from bs4 import BeautifulSoup as bs
html = """<ul class='checklist__list'>
            <li class=' checklist__item' id='checklist__item--country-111'>
              <label class='checklist__label ripple-animation'>
              <input  class="checklist__input js-checklist__input idb-on-change" type="checkbox" id="111" name="country" value="111">
                Germany
              </input>
              </label>
            </li>"""

soup = bs(html)
label = soup.find("label").text
id = soup.find("input").get("value")

You will have to clean the label as there are some extraneous spaces and newline characters in the output, but you should be able to extend this example however you need for further processing of these items.

To process multiple list items that all have the same markup format as above, you can use this snippet:

lis = soup.find_all("li")  # This will return a list of all line items in the markup.
for li in lis:
    li_label = li.find("label").text
    li_id = li.find("input").get("id")
    print(li_label, li_id)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that worked nicely! I removed the spaces & line breaks with the following line li_country= ' '.join(li.find("label").text.split())

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.