1

I have a program to parse various file formats with a goal to find localizable strings (GetText pretty much). I'm looking for a regex that would get "TEXT TO TRANSLATE" from within specific opening and closing tag. I had a working regex but the following example broke it, thanks to the IsVisible call.

<mw:Translate runat="server" Visible='<%# IsVisible() %>'>
TEXT TO TRANSLATE
</mw:Translate>

This is what I have so far but got stuck with it...any help? I have described my wrongly regexxed intentions in //comments...

(?s)                   //multiline flag

\<mw\:Translate        //opening <mw:Translate> tag

(?:(?![^"']+\s*\>)+)   //match anything but > preceeded by " or ' 
                       //with any whitespace after it
(?:["']+\s*)\>         //match > preceeded by " or ' with any 
                       //whitespace after it

\s*                    //match any whitespace 
                       //(for trimming any whitespace around the text)
(?<text>.*?)           //capturing group for the localizable text
\s*                    //match any whitespace 

\</mw\:Translate\>     //match closing tag

The problem I have is probably in the opening tag...I'm trying to match the closing bracket > only when it is preceeded by " or ' with no or any whitespace after that...because otherwise it's either something like %> or it's not a valid ASP.NET

EDIT 1: Please read the question before coming to conclusions. This is not HTML but ASP.NET which cannot be possibly parsed well with any HTML parsers. I'm also targeting something very specific. Correction: people seem to agree it can be parsed with HtmlAgility pack but I don't really want to use it, because I don't really like to rely on an external lib for one simple use case.

EDIT 2: mw:Translate cannot be nested. It simply won't compile because of how the mw:Translate is programmed.

EDIT 3: Clarification of edits.

EDIT 4: Self closing mw:translate is not permitted

EDIT 5: HTML inside mw:Translate is as valid as any other text on ASP.NET page

EDIT 6: answered myself, the regex I'd need may be a bit more complicated (but not because of any relation with HTML), see below

7
  • 1
    thanks to regex you have broken your code...use an html parser Commented Jul 8, 2013 at 10:54
  • 1
    Dont use regex to parse html but HtmlAgilityPack Commented Jul 8, 2013 at 10:54
  • there must be a reason why Regex is strongly discouraged to process HTML tags stackoverflow.com/questions/1732348/… Commented Jul 8, 2013 at 10:54
  • 3
    Guys, ASP.NET is not an HTML. It's not even a XML. Commented Jul 8, 2013 at 10:55
  • 1
    I have no idea why should I use anything called "HtmlAgilityPack" to parse ASP.NET. What are the pitfalls here? There are very specific and concrete rules with this use case. I'm very well aware of the problems regarding using REGEX to HTML but this is simply not the case. If you think it is, you are free to make a fool of me in a valid answer. Commented Jul 8, 2013 at 11:02

3 Answers 3

3

Even if this is ASP.NET and not HTML you can use HtmlAgilityPack to parse it.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is the aspx document text
var translatableTextNodes = 
    doc.DocumentNode.SelectNodes("//text()[contains(., 'TEXT TO TRANSLATE')]");
foreach (var parent in translatableTextNodes)
    Console.WriteLine("Node:[{0}] Text:{1}",parent.Name, parent.InnerText);

Output with a sample page containing one of your server control containing TEXT TO TRANSLATE:

Node:[mw:translate] Text:
TEXT TO TRANSLATE
Sign up to request clarification or add additional context in comments.

Comments

1

Even if you modifiy your regex.Here are some problems

  • wont work if there are other tags inside(next to impossible to solve this problem with regex)
  • asp.net can have self closing tags like <a href=''/>

Use htmlagilitypack

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//Translate")//this xpath selects all translate tag
                  .Select(p => p.InnerText)
                  .ToList();

//itemList now contain all the translate tags content

4 Comments

These tags cannot be nested. It won't compile. I agree I should have written that.
@Motig by nested tags i mean other tags like <b>,<i> not Translate
In that case, they are part of the text and are parsed as a text. <b> is no different from "asdf". Anything inside those tags is a string. If there's a closing tag anywhere, even as a part of any crazy invalid HTML, as long as it's a closing tag, it's the valid end of the text.
@Motig ok..if you want it..use this regex (?s)<mw:Translate(.*?["']|\s*)\s*>(?<text>.*?)</mw:Translate>
0

I'd try matching the list of attributes, assuming an attribute is wrapped in quotes or single quotes.
This is an assumption that isn't correct for all HTML, but it may work for you:

<mw:Translate       #opening <mw:Translate> tag
# Match attributes
(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*
\s*
>                   #match >
\s*
(?<text>.*?)        #capturing group for the localizable text
\s*                 #match any whitespace 
</mw:Translate>     #match closing tag

Working example: http://regexhero.net/tester/?id=5834b4f1-095b-4af6-a0da-d1fe119778bc

14 Comments

wont work with nested Translate tags..sometimes there's no closing tag
@Anirudh - Is that a requirement?
OP confirmed there are no nested tags
@Jerry Its not about nested tags..but ending tags..asp.net can have self closing tags
ASPX may also have inline C# code, server side tags that brake client side tags, etc (think about a <ul>, <li><li> and </ul> in repeater templates, for example). Trying to parse these as HTML is more dangerous than a regex, in my opinion. You need an ASPX parser.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.