0

I am getting an whole html page from an ajax request as text (xmlhttp.responseText)

Then filtering the text to extract a html form from that text and everything inside that form.

I wrote an regex :

text.match(/(<form[\W\w]*<\/form>)/gim)

As i am not an expert in regex, so i cant be sure will it work in every scenario and get everything inside the form tag?

Is there a better way that i can say everything in regex? so that the regex will look like

 text.match(/(<form[__everything_syntaxt_here__]*<\/form>)/gim)
5
  • Are you looking for the internal form tag stuff, or from <form..> to </form> or both ? Commented Jan 29, 2015 at 8:31
  • everything inside the <form> ......</form> tag and also the beginning and end tag too.@sln Commented Jan 29, 2015 at 8:33
  • 1
    see this:stackoverflow.com/questions/4288102/… Commented Jan 29, 2015 at 8:36
  • I would discourage you from using regexes for this at all. You can use responseXML or make a documentFragment or hidden <div> and approach the response as what it is, a HTML page with a DOM tree. So then you simply get parsedDom.getElementsByTagName('form')[0] and do what you want with it. Commented Jan 29, 2015 at 8:41
  • @funkwurm thanks for your concern. I have tried that and failed as the html comes with so much complex tags,meta tags and internal script the default parser of old browser (currently fighting stupid with IE5 :O ) failed to parse them. That why i am trying to help the old person here. Commented Jan 29, 2015 at 8:47

2 Answers 2

1

Having to deal with IE 5, you poor soul.

A quick answer to your question Is [\W\w] really the best way to match absolutely everything?

Yes, JavaScript does not support the s modifier to make . match newlines. Doing [\W\w] basically tells the regex: "Match anything that is a word character, or anything that isn't a word character", you can see that absolutely every character falls in either of those categories.

But, if you want a more reliable solution to deal with <!-- html comments --> and multiple forms on a page, best approach is something like explained in this SO answer but changed for HTML.

This is what I would use:

<!--(?:(?!-->)[\w\W])*-->|(<form(?:(?:(?!<\/form>|<!--)[\w\W])|(?:<!--(?:(?!-->)[\w\W])*-->))*</form>)

Regular expression visualization

Look at the Debuggex Demo to see what matches you actually get. In JavaScript you can then expect the first capture group. If it's empty then that was just to get rid of the commented form like explained here.

Sign up to request clarification or add additional context in comments.

3 Comments

its even more than i need. Thanks.
Matches <formica><form></form> and <form a="</form>"></form> and <script>var a="<form>"</script><form></form>
@sln which is indeed why you shouldn't parse HTML with regexes. But if the use-case is 1 user that is stuck on a slow IE 5 and so you can't use DOM manipulations or a server-side solution, this probably does a good job. This extended version addresses your specific matches but you'll always run into something. For instance nested forms, improper <form action="forgottenquote.php> and what if the <form> is actually entirely inside a JavaScript variable so you would want the content of <script>var a="<form>"; </script>?
1

Try this:

function stripForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
  while (i--) {
    scripts[i].parentNode.removeChild(scripts[i]);
  }
  return div.innerHTML;
}
function getForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
    var ret="";
  while (i--) {
    ret += scripts[i].innerHTML;
  }
  return ret;
}
var a = 'before Form <form action="" method="post"> <input type="text" /> <input type="text" /> <input type="text" /> </form><br/> after form';
alert(getForm(a));
alert(stripForm(a));
console.log(stripForm(a));

Demo

1 Comment

yah its make a good sense. But i think you have noticed that i'have said that the whole html page is coming as a response. so it may include tags like <html>,<meta>,<head>,<body> even internal scripts and style too. So i don't think it will be a good idea to set the whole text as innetHTML inside a div then parse it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.