filtering <form> from html text using regular expression

Question

I am getting an whole html page from an ajax request as text (xmlhttp.responseText)

Then filtering the text to extract a html form from that text and everything inside that form.

I wrote an regex :

text.match(/(<form[\W\w]*<\/form>)/gim)

As i am not an expert in regex, so i cant be sure will it work in every scenario and get everything inside the form tag?

Is there a better way that i can say everything in regex? so that the regex will look like

 text.match(/(<form[__everything_syntaxt_here__]*<\/form>)/gim)

Are you looking for the internal form tag stuff, or from <form..> to </form> or both ? — user557597
– user557597, Commented Jan 29, 2015 at 8:31
everything inside the <form> ......</form> tag and also the beginning and end tag too.@sln — Saif
– Saif, Commented Jan 29, 2015 at 8:33
I would discourage you from using regexes for this at all. You can use responseXML or make a documentFragment or hidden <div> and approach the response as what it is, a HTML page with a DOM tree. So then you simply get parsedDom.getElementsByTagName('form')[0] and do what you want with it. — asontu
– asontu, Commented Jan 29, 2015 at 8:41
@funkwurm thanks for your concern. I have tried that and failed as the html comes with so much complex tags,meta tags and internal script the default parser of old browser (currently fighting stupid with IE5 :O ) failed to parse them. That why i am trying to help the old person here. — Saif
– Saif, Commented Jan 29, 2015 at 8:47

Community · Accepted Answer · 2017-05-23 12:06:31Z

1

Having to deal with IE 5, you poor soul.

A quick answer to your question Is [\W\w] really the best way to match absolutely everything?

Yes, JavaScript does not support the s modifier to make . match newlines. Doing [\W\w] basically tells the regex: "Match anything that is a word character, or anything that isn't a word character", you can see that absolutely every character falls in either of those categories.

But, if you want a more reliable solution to deal with  and multiple forms on a page, best approach is something like explained in this SO answer but changed for HTML.

This is what I would use:

<!--(?:(?!-->)[\w\W])*-->|(<form(?:(?:(?!<\/form>|<!--)[\w\W])|(?:<!--(?:(?!-->)[\w\W])*-->))*</form>)

Regular expression visualization

Look at the Debuggex Demo to see what matches you actually get. In JavaScript you can then expect the first capture group. If it's empty then that was just to get rid of the commented form like explained here.

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Jan 29, 2015 at 10:23

asontu

4,6591 gold badge24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Saif Over a year ago

its even more than i need. Thanks.

user557597 Over a year ago

Matches <formica><form></form> and <form a="</form>"></form> and <script>var a="<form>"</script><form></form>

asontu Over a year ago

@sln which is indeed why you shouldn't parse HTML with regexes. But if the use-case is 1 user that is stuck on a slow IE 5 and so you can't use DOM manipulations or a server-side solution, this probably does a good job. This extended version addresses your specific matches but you'll always run into something. For instance nested forms, improper <form action="forgottenquote.php> and what if the <form> is actually entirely inside a JavaScript variable so you would want the content of <script>var a="<form>"; </script>?

Ahosan Karim Asik · Accepted Answer · 2015-01-29 10:18:08Z

1

Try this:

function stripForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
  while (i--) {
    scripts[i].parentNode.removeChild(scripts[i]);
  }
  return div.innerHTML;
}
function getForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
    var ret="";
  while (i--) {
    ret += scripts[i].innerHTML;
  }
  return ret;
}
var a = 'before Form <form action="" method="post"> <input type="text" /> <input type="text" /> <input type="text" /> </form><br/> after form';
alert(getForm(a));
alert(stripForm(a));
console.log(stripForm(a));

Demo

answered Jan 29, 2015 at 10:18

Ahosan Karim Asik

3,3191 gold badge22 silver badges27 bronze badges

1 Comment

Saif Over a year ago

yah its make a good sense. But i think you have noticed that i'have said that the whole html page is coming as a response. so it may include tags like <html>,<meta>,<head>,<body> even internal scripts and style too. So i don't think it will be a good idea to set the whole text as innetHTML inside a div then parse it.

Collectives™ on Stack Overflow

filtering <form> from html text using regular expression

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related