0

I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:

 <template>
Hello, there <RECALL>first_name</RECALL>.  Thanks for giving me your email.  
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>.  I have just sent you something.
</template>

However, I only get as a text string what is between the <template> tags.

I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.

With Crack, I can put a string like

string = "<SETPROFILE><NAME>email</NAME><VALUE>[email protected]</VALUE></SETPROFILE>"

and my output from Crack is:

{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"[email protected]"}}

Then I can use a case statement for the possible values I care about.

Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?

These tags also need to be removed. I would like to continue to use the excellent suggestion from @TinMan.

It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.

2
  • This is a bit confusing. What do you want as output? Commented Dec 28, 2014 at 20:25
  • Output similar to crack would be a great. Parsed keys and associated values Commented Dec 28, 2014 at 22:32

1 Answer 1

2

Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:

require 'nokogiri'

doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>.  Thanks for giving me your email.  
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>.  I have just sent you something.
EOT

nodes = doc.search('*').each_with_object({}){ |n, h|
  h[n] = n.text
}

nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}

Or, more legibly:

nodes = doc.search('*').each_with_object({}){ |n, h|
  h[n.name] = n.text
}

nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}

Getting the content of a particular tag is easy then:

nodes['RECALL'] # => "first_name"

Iterating over all the tags is also easy:

nodes.keys.each do |k| 
  ... 
end

You can even replace a tag and its content with text:

doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred.  Thanks for giving me your email.  \n<SETPROFILE>\n  <NAME>email</NAME>\n  <VALUE>\n    <star/>\n  </VALUE>\n</SETPROFILE>.  I have just sent you something.\n"

How to replace the nested tags is left to you as an exercise.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Will try this out shortly

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.