4

I'm trying to create a list of filter facets. I've loaded all the <span> in to a list with bs4 and now need to grab a specific substring out of the larger string that is the <span>. I want to load each filter facet name in to a list to end up with a list that looks like this: [size, width, colour, etc].

list generated with bs4

[<span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>]

what I've tried and doesn't seem to get me anywhere:

facetcode = [str(i) for i in spans]

facets = []

for i in facetcode:
    facetcode1 = i.split(' ')
    for y in facetcode1:
        if 'data-facet-name' == True:
            print(y)

when I print(y) it give me a blank list but I'm expecting something like: data-facet-name="Size"

The result I want:

[size, width, colour, etc]

Am I over complicating this? The idea is to iterate over each list element and load only the text I want in to a new list.

2
  • Converting to string and then parsing those strings seems like totally the wrong approach. You have a structured markup language there; use a tool which understands its structure instead of writing your own ad hoc parser. Commented Sep 6, 2019 at 18:15
  • 1
    This is one of the best first posts I've seen in a while! Well done Commented Sep 6, 2019 at 18:15

4 Answers 4

2

You want to extract the attribute data-facet-name from the span's that have that attribute. If you really want a list you can convert the set to a list after.

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>
 </body>
</html>
  '''
soup = bs(html, 'lxml') #or 'html.parser'
print({i['data-facet-name'] for i in soup.select('span[data-facet-name]')})

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this works! What is causing the printed result to be in a different order than the spans in the html?
Sets don’t have order. You could just use a list comprehension then do unique on it I suspect.
1

I think you might be missing some of the power of BS4!

import bs4

soup = bs4.BeautifulSoup('''<span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>''', 'html.parser')

for span in soup.find_all('span', **{'data-facet-name': True}):
    print(span['data-facet-name'])

1 Comment

I think you're right! thanks for taking a moment to teach. facets = [] for span in soup.findAll('span', **{'data-facet-name': True}): facets.append(span['data-facet-name']) print (facets)
1

Here's a greedy list comprehension, assuming your data is in a list named bs4_arr:

attributes = ['='.join(word.split('=')[1:]).strip('"') for word in bs4_arr.split() if word.split('=')[0] == 'data-facet-name']

Here's what it's doing:

  • iterate through every word in your HTML list
  • split the word on =
  • if the attribute name is data-facet-name, then we append the attribute value to our result

This is greedy because it calls word.split('=') twice.

You can do it without a list comprehension, as well (less greedy):

attributes = []
for word in bs4_arr.split():
    tokens = word.split('=')
    name = tokens[0]
    value = '='.join(tokens[1:]).strip('"')
    if name == 'data-facet-name':
         attributes.append(value)

A better approach, however, would be to continue using BeautifulSoup to parse your HTML.

Comments

0

You're printing y only when the string "data-facet-name" is equal to True, which it never is. I think you want that line to be if y == "data-facet-name".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.