1

I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.

However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.

I first have this variable declared to check if the page has an element using the Product schema microdata.

var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');

I then wanted to select for all elements with the itemprop attribute. e.g.

productMicrodata.querySelectorAll('[itemprop]');

The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.

I figured I would then just be able to do something like this:

productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');

However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).

I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.

EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.

EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|

EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.

let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};

for (let i = 0; i < productMicrodata.length; i++) {
    if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
        itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent; 
    }
}

console.log(itemProp);

itemprop elements with itemtype parent attributes

2 Answers 2

0

:not([itemscope]) [itemprop] means:

An element with an itemprop attribute and any ancestor with no itemprop ancestor.

So:

<div>
    <div itemprop>
        <div itemprop> <!-- this one -->
        </div>
    </div>
</div>

… would match because while the parent element has the attribute, the grandparent does not.

You need to use the child combinator to eliminate elements with matching parent elements:

:not([itemscope]) > [itemprop]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks so much for the quick response, although the selector you've provided doesn't really address my problems. Apologies as I don't think I've explained well, I've added a screenshot for more clarity. But basically I'm just trying to match any element with an [itemprop] attribute, as long as it doesn't have any parents (direct or non-direct) that have an [itemscope] attribute if that makes sense.
@WabiSabi — There's no such thing as a "non-direct parent". An element has one parent. If you keep going up then those are ancestors. (The parent is also an ancestor).
Ah yes thanks for the clarification, have updated the post to replace parent with ancestor as that is exactly what I meant. Apologies, still quite new to all this.
Just found one of your posts here from a while back that I think addresses what I'm trying to do. Is it still the case that this isn't possible with CSS selectors?
0

[...] help on how I can achieve only selecting elements that have only the itemtype="http://schema.org/Product" attribute would be much appreciated.

Attribute selectors can take explicit values:

[myAttribute="myValue"]

So the syntax for this would be:

var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');

1 Comment

Thanks so much for your answer. I think you've missed the issue I'm having however. Please see the edited post and comment on @quentin 's response for as to what I'm trying to achieve. To copy and paste from that comment, " I'm just trying to match any element with an [itemprop] attribute, as long as it doesn't have any parents (direct or non-direct) that have an [itemscope] attribute"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.