Exclude part of string with Regex in python with web scraping

Question

I'm trying to scrape some data from an e-commerce website for a personal project. I'm trying to build a nested list of strings from the html but am having an issue with one part of the html. Each list item appears as the following:

<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>

What I have now is a regex that turns all the items in the data-impressions tag like so and splits them at the comma:

list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]
list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]

Which gives me a list of lists of lists for each thing which will become a key:value pair in a dictionary. For the example above here is what the second level item would be:

[['"id"', '"01920"'],
  ['"name"', '"Sleepy"'],
  ['"price"', '12.95'],
  ['"brand"', '"Lush"'],
  ['"category"', '"Bubble Bar"'],
  ['"variant"', '"7 oz."'],
  ['"quantity"', '1'],
  ['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],
  ['"dimension11"', '""'],
  ['"dimension12"', '"Naked'],
  ['Self Preserving'],
  ['Vegan"'],
  ['"dimension13"', '1'],
  ['"dimension14"', '1'],
  ['"dimension15"', 'true']]

My problem is with dimension12, I can't figure out how to exclude that dimension from splitting at the comma, so that that list would appear as:

['"dimension12"', '"Naked,Self Preserving,Vegan"']

Any help is appreciated, thanks.

baduker · Accepted Answer · 2020-10-29 18:02:11Z

1

I'd like to suggest a bit different approach. That attribute value looks like JSON, so why not use a json module? That way, you have a ready-made data structure, that you can modify further.

import json
from bs4 import BeautifulSoup


html_list = [
"""<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>""",
]

data_structures = []
for html_item in html_list:
    soup = BeautifulSoup(html_item, "html.parser").find("div", {"class": "impressions"})
    data_structures.append(json.loads(soup["data-impressions"]))

print(data_structures)

This outputs a list of dictionaries:

[{'id': '01920', 'name': 'Sleepy', 'price': 12.95, 'brand': 'Lush', 'category': 'Bubble Bar', 'variant': '7 oz.', 'quantity': 1, 'list': '/bath/bubble-bars/sleepy/9999901920.html', 'dimension11': '', 'dimension12': 'Naked,Self Preserving,Vegan', 'dimension13': 1, 'dimension14': 1, 'dimension15': True}]

To access the desired key, just do this:

for data_item in data_structures:
    print(data_item["dimension12"])

Prints: Naked,Self Preserving,Vegan

edited Oct 29, 2020 at 18:02

answered Oct 29, 2020 at 17:56

baduker

20.2k9 gold badges43 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ambrrrgris Over a year ago

This worked great, thanks! Can I ask what tipped you off to it being JSON?

baduker Over a year ago

Years of experience. ;)

baduker Over a year ago

JSON is often embedded into HTML, so it felt like a natural way to go.

ambrrrgris Over a year ago

makes sense. This is my first outing with web scraping and data analysis without a tutorial so I'm still working on intuiting which processes to use.

baduker Over a year ago

It'll come with time and practice. Happy coding!

Collectives™ on Stack Overflow

Exclude part of string with Regex in python with web scraping

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related