4

I'm trying to parse HTML files using an XML/HTML parser which contain hidden commented text for translation, namely X and Y below.

<!-- Title: “ X ” Tags: “ Y ” -->

Which XPath would best match X and Y? The //comment() function matches the whole node but I need to match the two occurences of text between and quotes.

I guess one would need a combination of XPath and regular expressions to do that but I'm not sure how to tackle that.

2
  • Are you using JavaScript? Then please mention that or add a tag or both. Commented Oct 12, 2012 at 12:38
  • What language are you implementing this in? Commented Oct 12, 2012 at 12:38

1 Answer 1

4

I assume that the quotes in the comment are the same, regular qoute character " -- not the typographically different starting and ending quote that appears when this question is displayed.

In case this assumption is wrong, simply replace the standard quote in the below expressions with the respective quote.


Use (if the comment in question is the first one in the document):

substring-before(substring-after(//comment(), '"'), '"')

This produces the string (without the quotes):

" X "

And for the second string in quotes use:

substring-before(
   substring-after(
        substring-after(
               substring-after(//comment(), '"'), 
               '"'), 
        '"'), 
   '"')

XSLT - based verification (Because an XSLT stylesheet must be a well-formed XML document we replace the quotes in the expressions with the entity &quot; -- just to avoid errors due to nested quotes):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     "<xsl:copy-of select="substring-before(substring-after(//comment(), '&quot;'), '&quot;')"/>"
=============
   "<xsl:copy-of select=
   "substring-before(substring-after(substring-after(substring-after(//comment(), '&quot;'), '&quot;'), '&quot;'), '&quot;')"/>"
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied against this XML document:

<html>
  <body>
    Hello.
<!-- Title: " X " Tags: " Y " -->
  </body>
</html>

the two XPath expressions are evaluated and the results of these two evaluations are copied to the output (surrounded by quotes to show the exact strings copied):

     " X "
=============
   " Y "
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.