1

I had XML file like http://www.heureka.cz/direct/xml-export/shops/heureka-sekce.xml. I'm unable to change it because it is not mine. It's just parsing from another website.

Here's XML (with structure):

<HEUREKA>
  <CATEGORY>
    <CATEGORY_ID>971</CATEGORY_ID>
    <CATEGORY_NAME>Auto-moto</CATEGORY_NAME>
    <CATEGORY>
      <CATEGORY_ID>881</CATEGORY_ID>
      <CATEGORY_NAME>Alkohol testery</CATEGORY_NAME>
      <CATEGORY_FULLNAME>Heureka.cz | Auto-moto | Alkohol testery</CATEGORY_FULLNAME>
    </CATEGORY>
  </CATEGORY>
</HEUREKA>

Thanks to all commenting here is final Code

def heureka
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::XML(open("http://www.heureka.cz/direct/xml-export/shops/heureka-sekce.xml"))

doc.xpath("//CATEGORY[CATEGORY_FULLNAME]").each do |node|
record = Heureka.where("name" => node.css('CATEGORY_NAME').inner_text).first_or_initialize
record.fullname=node.xpath('CATEGORY_FULLNAME').inner_text
record.name=node.xpath('CATEGORY_NAME').inner_text                                                                                         
record.save unless record.fullname.blank?                                                                                                  
end                                                                                                                                        
end                         
2
  • Please, show what do you want to get, and how it is working now. And the document has a few levels of categories, you should check it in each loop. Commented May 21, 2014 at 10:24
  • @zishe i am expect my code traverse every level and if CATEGORY_FULLNAME present then store it to DB . Commented May 28, 2014 at 12:49

3 Answers 3

6

Using nokogiri in this place seems a litte oversized. You can do this with plain ruby:

require 'net/http'
xml_content = Net::HTTP.get(URI.parse('http://www.heureka.cz/direct/xml-export/shops/heureka-sekce.xml'))
data = Hash.from_xml(xml_content)

Then your able to access the data as a hash object.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi i am sorry but i am newbie can you tell me how to save only categories with filled <CATEGORY_FULLNAME>
1

If we indent your XML you will see the problem:

<HEUREKA>
  <CATEGORY>
    <CATEGORY_ID>971</CATEGORY_ID>
    <CATEGORY_NAME>Auto-moto</CATEGORY_NAME>
    <CATEGORY>
      <CATEGORY_ID>881</CATEGORY_ID>
      <CATEGORY_NAME>Alkohol testery</CATEGORY_NAME>
      <CATEGORY_FULLNAME>Heureka.cz | Auto-moto | Alkohol testery</CATEGORY_FULLNAME>
    </CATEGORY>
  </CATEGORY>
</HEUREKA>

The second category node is inside the first category node, so it also its child. Because of this children.css('CATEGORY_NAME').inner_text will return both names concatenated (Auto-motoAlkohol testery) for the first node, and the last one will have the expected data - (Alkohol testery).

Fix your XML:

<HEUREKA>
  <CATEGORY>
    <CATEGORY_ID>971</CATEGORY_ID>
    <CATEGORY_NAME>Auto-moto</CATEGORY_NAME>
  </CATEGORY>
  <CATEGORY>
    <CATEGORY_ID>881</CATEGORY_ID>
    <CATEGORY_NAME>Alkohol testery</CATEGORY_NAME>
    <CATEGORY_FULLNAME>Heureka.cz | Auto-moto | Alkohol testery</CATEGORY_FULLNAME>
  </CATEGORY>
</HEUREKA>

And try again...


Update

If you can't change the XML, you can use XPATH instead of CSS, as its default behavior is to find the immediate children, rather than all the children (deep children):

def heurekacat
  require 'open-uri'
  require 'nokogiri'
  doc = Nokogiri::XML(open("http://www.heureka.cz/direct/xml-export/shops/heureka-sekce.xml"))
  doc.css("CATEGORY").each do |node|
    record = HeurekaCat.where("name" => children.xpath('CATEGORY_NAME').inner_text).first_or_initialize
    record.category=node.xpath('CATEGORY_FULLNAME').inner_text
    record.name=node.xpath('CATEGORY_NAME').inner_text
    record.save
  end
end

12 Comments

Hi am unable to correct XML , i am had to find solution for nested categories and save only ones which saves only categories with filled<CATEGORY_FULLNAME>
@TomasKrmela - added a solution for a case where you can't fix the XML
how to save to db only if CATEGORY_FULLNAME is presented at node
record.save unless record.category.blank?
Thanks Uri Agassi for solving my issue
|
0

Simply change one line:

doc.css("CATEGORY").each do |node|

to the following:

doc.css("CATEGORY:has(CATEGORY_FULLNAME)").each do |node|

This selects only CATEGORY elements containing a CATEGORY_FULLNAME subelement.

As an alternative, the equivalent XPath:

doc.xpath("//CATEGORY[CATEGORY_FULLNAME]").each do |node|

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.