0

I have a input XML file (not HTML) and I want to make changes to the tags. Wherever I find "p" node is child of "step" node, I need to remove it but the content should remain and assigned to "step". Also, the output should be a xml file and I am using R.

<h2>
<h4>
<stepgrp type="ordered-legal">
<figgrp-inlist>
<step>
<graphic version="1" object-id="4188" />
<p>Install the clutch spring compressor.</p>
</step>
<stepgrp2 type="unordered-bullet">
<step>
<p>One piece case  use J414202
Disc.</p>
</step>
<step>
Two piece case  use J42628 Disc.
</step>
</stepgrp2>
</figgrp-inlist>
<figgrp-inlist>
<step>
<graphic version="1" object-id="59269" />
<p>Tighten the clutch spring compressor.</p>
</step>
<step>
Remove the low/reverse clutch retainer ring.
</step>
<step>
Remove the low/reverse the clutch spring assembly.
</step>
</figgrp-inlist>
<figgrp-inlist>
<step>
<graphic version="1" object-id="4190" />
<p>Blow compressed air into the case passage to remove the
low/reverse clutch piston.</p>
</step>
</figgrp-inlist>
</stepgrp>
</h4>
</h2>

I have written a for loop code where it identifies the line position of "p" and "step" nodes but I want to make it dynamic so that it identifies the "p" node and removes it whenever it is the child of "step" node but the content should remain.
Thank you!

2 Answers 2

1

Assuming the variable xml contains your example:

# xml <- '<h2>...'
library(XML)
doc <- xmlParse(xml, asText = TRUE)
invisible(removeNodes(doc['//step/p']))
saveXML(doc, file = tf <- tempfile(fileext = ".xml"))
# <?xml version="1.0"?>
# <h2>
#   <h4>
#     <stepgrp type="ordered-legal">
#       <figgrp-inlist>
#         <step>
#           <graphic version="1" object-id="4188"/>
#         </step>
#         <stepgrp2 type="unordered-bullet">
#           <step/>
#           <step>
# Two piece case  use J42628 Disc.
# </step>
#         </stepgrp2>
#       </figgrp-inlist>
#       <figgrp-inlist>
#         <step>
#           <graphic version="1" object-id="59269"/>
#         </step>
#         <step>
# Remove the low/reverse clutch retainer ring.
# </step>
#         <step>
# Remove the low/reverse the clutch spring assembly.
# </step>
#       </figgrp-inlist>
#       <figgrp-inlist>
#         <step>
#           <graphic version="1" object-id="4190"/>
#         </step>
#       </figgrp-inlist>
#     </stepgrp>
#   </h4>
# </h2>

The output is stored in the filename, which is in tf (temporary file).


Add

With regards to your comment, try:

doc <- xmlParse(xml, asText = TRUE)
nodes <- doc['//step']
idx <- which(sapply(nodes, function(x) 'p' %in% names(xmlChildren(x))))
vals <- sapply(nodes[idx], xmlValue)
removeNodes(doc['//step/p'])
for (x in seq_len(length(vals)))
  newXMLTextNode(text = vals[x], doc['//step'][[idx[x]]])

There may be a more elegant version, though.

Sign up to request clarification or add additional context in comments.

9 Comments

Hey! I want to remove "p" node only, not the content. The content is also being removed. Is there a way? Something like the "p" node just became blank.
nodes gets als step nodes, idx holds the indices of nodes with p, vals gets the texts, then the p's are removed and new text nodes added at the appropriate positions.
It works well when I apply it for a single file but when I put it in a function and apply it to many files, it throws an error. Error in which(sapply(nodes, function(z) "step" %in% names(xmlChildren(z)))) : argument to 'which' is not logical
Maybe you got which(NULL). Can't debug it without having the code and the files.
I applied a trycatch to your code and it worked. It was not identifying the files for which the nodes were not available.
|
0

please find the answer my friend came up with and it sure works!

t1 <- readLines('xml')

t2 <-paste(t1,collapse = "\n")
t3 <- regmatches(t2, regexpr('<step>.+</step>', t2))
t4 <- as.character(unlist(strsplit(as.character(t3),"\n")))

torf <- t1 %in% t4

t5 <- character(length(t1))

for(i in 1 :length(t1)){
  if(torf[i]){
    t5[i] <- t1[i]
  } else {
    t5[i] <- t5[i]
  }
}


removep <- function(x){
  x1 <- gsub("<p>","",x)
  x2 <- gsub("</p>","",x1)
  return (x2)
}

t5 <- removep(t5)

for(i in 1:length(t5)){
  if(t5[i]!=""){
    t5[i] <- t5[i]
  } else {
    t5[i] <- t1[i]
  }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.