0

I'm trying to scrape some data from an e-commerce website for a personal project. I'm trying to build a nested list of strings from the html but am having an issue with one part of the html. Each list item appears as the following:

<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>

What I have now is a regex that turns all the items in the data-impressions tag like so and splits them at the comma:

list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]
list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]

Which gives me a list of lists of lists for each thing which will become a key:value pair in a dictionary. For the example above here is what the second level item would be:

[['"id"', '"01920"'],
  ['"name"', '"Sleepy"'],
  ['"price"', '12.95'],
  ['"brand"', '"Lush"'],
  ['"category"', '"Bubble Bar"'],
  ['"variant"', '"7 oz."'],
  ['"quantity"', '1'],
  ['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],
  ['"dimension11"', '""'],
  ['"dimension12"', '"Naked'],
  ['Self Preserving'],
  ['Vegan"'],
  ['"dimension13"', '1'],
  ['"dimension14"', '1'],
  ['"dimension15"', 'true']]

My problem is with dimension12, I can't figure out how to exclude that dimension from splitting at the comma, so that that list would appear as:

['"dimension12"', '"Naked,Self Preserving,Vegan"']

Any help is appreciated, thanks.

1 Answer 1

1

I'd like to suggest a bit different approach. That attribute value looks like JSON, so why not use a json module? That way, you have a ready-made data structure, that you can modify further.

import json
from bs4 import BeautifulSoup


html_list = [
"""<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>""",
]

data_structures = []
for html_item in html_list:
    soup = BeautifulSoup(html_item, "html.parser").find("div", {"class": "impressions"})
    data_structures.append(json.loads(soup["data-impressions"]))

print(data_structures)

This outputs a list of dictionaries:

[{'id': '01920', 'name': 'Sleepy', 'price': 12.95, 'brand': 'Lush', 'category': 'Bubble Bar', 'variant': '7 oz.', 'quantity': 1, 'list': '/bath/bubble-bars/sleepy/9999901920.html', 'dimension11': '', 'dimension12': 'Naked,Self Preserving,Vegan', 'dimension13': 1, 'dimension14': 1, 'dimension15': True}]

To access the desired key, just do this:

for data_item in data_structures:
    print(data_item["dimension12"])

Prints: Naked,Self Preserving,Vegan

Sign up to request clarification or add additional context in comments.

5 Comments

This worked great, thanks! Can I ask what tipped you off to it being JSON?
Years of experience. ;)
JSON is often embedded into HTML, so it felt like a natural way to go.
makes sense. This is my first outing with web scraping and data analysis without a tutorial so I'm still working on intuiting which processes to use.
It'll come with time and practice. Happy coding!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.