Parsing an XML file with Nokogiri to determine the path (Ruby)

Question

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.

This code:

    @doc, items = Nokogiri.XML(@file), []

    path = []
    @doc.traverse do |node|
      if node.class.to_s == "Nokogiri::XML::Element"
        is_path_element = false
        node.children.each do |child|
          is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
        end
        path.push(node.name) if is_path_element == true && !path.include?(node.name)
      end
    end
    final_path = "/"+path.reverse.join("/")

works for simple XML files, for example:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
  </channel>
</rss>

puts final_path # => "/rss/channel/item"

But when it gets more complicated, how should I then approach the challenge? For example with this one:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
  </channel>
</rss>

Not sure I understand - what should analysis of the second XML example produce, and why? Your code appears to construct a path (that may not exist) consisting of anything that has child elements . . . — Neil Slater
– Neil Slater, Commented Mar 28, 2013 at 19:20
Hi Niel. My assumption is that the text nodes that lies at the deepest level must be relevant. So the code is supposed to determine the path that leads to the deepest nested text nodes. Why should the determined path not exist? — Cjoerg
– Cjoerg, Commented Mar 28, 2013 at 19:32
Sorry, I forgot to answer the first question: I am not certain what the analysis of the second XML should produce. Either multiple paths, e.g. "/rss/channel/item/titles" + "/rss/channel/item/brands", or maybe some regex, e.g. /\/rss\/channel\/item\/.*/ — Cjoerg
– Cjoerg, Commented Mar 28, 2013 at 19:37
If you have two equally deep structures (e.g. as well as /rss/channel/item path with children, there was /rss/channel/owner in your first file, both will get added to your array, you would see something like "/rss/channel/item/owner" — Neil Slater
– Neil Slater, Commented Mar 28, 2013 at 19:37
If I take "My assumption is that the text nodes that lies at the deepest level must be relevant." - it would be straightforward to give you something that listed the containers for the deepest structures, your code would only need slight changes. Are you sure that's what you want at this stage? — Neil Slater
– Neil Slater, Commented Mar 28, 2013 at 19:41

Neil Slater · Accepted Answer · 2013-03-28 21:52:35Z

4

If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.

Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:

xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }

The output of this for your second example file is:

/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands

. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .

paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

Or in one line:

paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

edited Mar 28, 2013 at 21:52

answered Mar 28, 2013 at 21:42

Neil Slater

27.4k6 gold badges80 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cjoerg Over a year ago

That is just the most beautiful I have ever seen. Thanks very much.

Ivan Ivanchuk · Accepted Answer · 2022-09-23 02:46:14Z

0

I'm created a library to build xpath.

xpath = Jini.new
        .add_path('parent')
        .add_path('child')
        .add_all('toys')
        .add_attr('name', 'plane')
        .to_s
puts xpath // -> /parent/child//toys[@name="plane"]

edited Sep 23, 2022 at 2:46

answered Sep 15, 2022 at 17:58

Ivan Ivanchuk

3381 silver badge13 bronze badges

1 Comment

Adrian Mole Over a year ago

When linking a site/repo/blog that is your own, you must explicitly disclose that fact. Otherwise, your post is likely to be flagged as spam.

Collectives™ on Stack Overflow

Parsing an XML file with Nokogiri to determine the path (Ruby)

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related