1

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.

This code:

    @doc, items = Nokogiri.XML(@file), []

    path = []
    @doc.traverse do |node|
      if node.class.to_s == "Nokogiri::XML::Element"
        is_path_element = false
        node.children.each do |child|
          is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
        end
        path.push(node.name) if is_path_element == true && !path.include?(node.name)
      end
    end
    final_path = "/"+path.reverse.join("/")

works for simple XML files, for example:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
  </channel>
</rss>

puts final_path # => "/rss/channel/item"

But when it gets more complicated, how should I then approach the challenge? For example with this one:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
  </channel>
</rss>
7
  • Not sure I understand - what should analysis of the second XML example produce, and why? Your code appears to construct a path (that may not exist) consisting of anything that has child elements . . . Commented Mar 28, 2013 at 19:20
  • Hi Niel. My assumption is that the text nodes that lies at the deepest level must be relevant. So the code is supposed to determine the path that leads to the deepest nested text nodes. Why should the determined path not exist? Commented Mar 28, 2013 at 19:32
  • Sorry, I forgot to answer the first question: I am not certain what the analysis of the second XML should produce. Either multiple paths, e.g. "/rss/channel/item/titles" + "/rss/channel/item/brands", or maybe some regex, e.g. /\/rss\/channel\/item\/.*/ Commented Mar 28, 2013 at 19:37
  • If you have two equally deep structures (e.g. as well as /rss/channel/item path with children, there was /rss/channel/owner in your first file, both will get added to your array, you would see something like "/rss/channel/item/owner" Commented Mar 28, 2013 at 19:37
  • If I take "My assumption is that the text nodes that lies at the deepest level must be relevant." - it would be straightforward to give you something that listed the containers for the deepest structures, your code would only need slight changes. Are you sure that's what you want at this stage? Commented Mar 28, 2013 at 19:41

2 Answers 2

4

If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.

Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:

xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }

The output of this for your second example file is:

/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands

. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .

paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

Or in one line:

paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Sign up to request clarification or add additional context in comments.

1 Comment

That is just the most beautiful I have ever seen. Thanks very much.
0

I'm created a library to build xpath.

xpath = Jini.new
        .add_path('parent')
        .add_path('child')
        .add_all('toys')
        .add_attr('name', 'plane')
        .to_s
puts xpath // -> /parent/child//toys[@name="plane"]

1 Comment

When linking a site/repo/blog that is your own, you must explicitly disclose that fact. Otherwise, your post is likely to be flagged as spam.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.