I'm trying to scrape some data from an e-commerce website for a personal project. I'm trying to build a nested list of strings from the html but am having an issue with one part of the html. Each list item appears as the following:
<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>
What I have now is a regex that turns all the items in the data-impressions tag like so and splits them at the comma:
list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]
list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]
Which gives me a list of lists of lists for each thing which will become a key:value pair in a dictionary. For the example above here is what the second level item would be:
[['"id"', '"01920"'],
['"name"', '"Sleepy"'],
['"price"', '12.95'],
['"brand"', '"Lush"'],
['"category"', '"Bubble Bar"'],
['"variant"', '"7 oz."'],
['"quantity"', '1'],
['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],
['"dimension11"', '""'],
['"dimension12"', '"Naked'],
['Self Preserving'],
['Vegan"'],
['"dimension13"', '1'],
['"dimension14"', '1'],
['"dimension15"', 'true']]
My problem is with dimension12, I can't figure out how to exclude that dimension from splitting at the comma, so that that list would appear as:
['"dimension12"', '"Naked,Self Preserving,Vegan"']
Any help is appreciated, thanks.