2

I need an regex to find <Field ...name="document"> or <FieldArray ...name="document"> to replace with an empty string. They can be defined across multiple lines.

This is not html or xhtml, it's just a text string containing <Field> and <FieldArray>

Example with Field:

      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />

Example with FieldArray:

      <FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      />

the are inside a list of components. Example:

      <Field
        name="amount"
        component={FormField}
        label={t('form.amount')}
      />
      <Field
        name="datereception"
        component={FormField}
        label={t('form.datereception')}
      />
      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datedeferred"
        component={FormField}
        label={t('form.datedeferred')}
      />

I've have read some solutions like to find src in Extract image src from a string but his structure is different a what i'm looing for.

2
  • You should check stackoverflow.com/a/1732454/3195314 Commented Dec 18, 2017 at 8:34
  • this is not html or xhmtl, i'ts just string with 2 properties Commented Dec 19, 2017 at 12:16

3 Answers 3

2
+50

It is not advisable to parse [X]HTML with regex. If you have a possibility to use a domparser, I would advise using that instead of regex.

If there is no other way, you could this approach to find and replace your data:

<Field(?:Array)?\b(?=[^\/>]+name="document")[^>]+\/>

Explanation

  • Match <Field with optional "Array" and end with a word boundary <Field(?:Array)?\b
  • A positive lookahead (?=
  • Which asserts that following is not /> and encounters name="document" [^\/>]+name="document"
  • Match not a > one or more times [^>]+
  • Match \/>

var str = `<Field
    name="amount"
    component={FormField}
    label={t('form.amount')}
  />
  <Field
    name="datereception"
    component={FormField}
    label={t('form.datereception')}
  />
  <Field
    component={FormField}
    name="document"
    typeInput="selectAutocomplete"
  />
  <Field
    name="datedeferred"
    component={FormField}
    label={t('form.datedeferred')}
  />
<FieldArray
    component={FormField}
    typeInput="selectAutocomplete"
    name="document"
  /><FieldArray
    component={FormField}
    typeInput="selectAutocomplete"
    name="document"
  />` ;
str = str.replace(/<Field(?:Array)?\b(?=[^\/>]+name="document")[^>]+\/>/g, "");
console.log(str);

Sign up to request clarification or add additional context in comments.

7 Comments

i did not test your code in mine, but i think it's going to work, my code is not xhtml or html, just component tags <Tag />
Given the generous lookahead here, your optional (?:Array)? doesn't do anything. maybe you intended to have a \b after it to denote the end of that tag? Also, your [\s\S]+? (nongreedy expansion) is expensive. Why not use [^>]+ instead? <Field(?:Array)?\b(?=[^\/>]+name="document")[^>]+\/>. You might also be interested in using template literals for multi-line strings to clean up that example. I'm not sure why there's a -1 on this answer, it looks good to me.
@DDave – It looks like your code is XML, which has the same issue. You're still better off using an actual XML parser. DOM parsers can handle this.
@AdamKatz Thank you for your comment! I have updated my answer.
You may not believe this, but it's not good enough to use [^>]. Your regex matches <Field but = "name="document"/> which is valid html but does not contain the name="document" attrib/value.
|
2

Here's an answer with actual XML parsing and no regular expressions:

var xml = document.createElement("xml");
xml.innerHTML = `
      <Field
        name="amount"
        component={FormField}
        label={t('form.amount')}
      />
      <FieldDistractor
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datereception"
        component={FormField}
        label={t('form.datereception')}
      />
      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datedeferred"
        component={FormField}
        label={t('form.datedeferred')}
      />
      <FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      /><FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      />
`;

var match = xml.querySelectorAll(
  `field:not([name="document"]), fieldarray:not([name="document"]),
    :not(field):not(fieldarray)`
);
var answer = "";
for (var m=0, ml=match.length; m<ml; m++) {
  // cloning the node removes children, working around the DOM bug
  answer += match[m].cloneNode().outerHTML + "\n";
}
console.log(answer);

In writing this answer, I found a bug in the DOM parser for both Firefox (Mozilla Core bug 1426224) and Chrome (Chromium bug 796305) that didn't allow creating empty elements via innerHTML. My original answer used regular expressions to pre- and post-process the code to make it work, but using regexes on XML is so unsavory that I later changed it to merely strip off children by using cloneNode() (with its implicit deep=false).

So we dump the XML into a dummy DOM element (which we don't need to place anywhere), then we run querySelectorAll() to match some CSS that specifies your requirements:

  • field:not([name="document"]) "Field" elements lacking name="document" attributes, or
  • fieldarray:not([name="document"]) "FieldArray" elements lacking that attribute, or
  • :not(field):not(fieldarray) Any other element

4 Comments

This [^>] by itself isn't sufficient to parse html tags.
I removed the regex code and used a non-regex workaround rather than dealing with ridiculously arcane XML-parsing issues (which are the reason for avoiding regexes in the first place).
Yeah but nobody's talking about parsing XML/Xhtml/html. The issue is parsing tags or markup. Note that the given specs by w3c are written using regex to begin with. A typical use is a sax parser. Incase you don't think regex can be used, you can take a look at this which strips all html markup and invisible content from any html source: regex101.com/r/4jvwsH/1
This is not a bug in either Chrome's or Firefox's DOM Parser. There are a limited number of empty elements in HTML, HTML is not XML.
0

You can parse HTML tags with regex because parsing the tags themselves are nothing special and are the first thing parsed as an atomic operation.

But, you can't use regex to go beyond the atomic tag.
For example, you can't find the balanced tag closing to match the open as
this would put a tremendous strain on regex capability.

What a Dom parser does is use regex to parse the tags, then uses internal
algorithms to create a tree and carry out processing instructions to interpret
and recreate an image.
And of course regex doesn't do that.

Sticking to strictly parsing tags, including invisible content (like script),
is not that easy as well.
Content can hide or embed tags that, when you look for them, you shouldn't
find them.

So, in essence, you have to parse the entire html file to find the real
tag your looking for.
There is a general regex that can do this that I will not include here.
But if you need it let me know.

So, if you want to jump straight into the fire without parsing all the
tags of the entire file, this is the regex to use.

It is essentially a cut up version of the one that parses all tags.
This flavor finds the tag and any attribute=value that you need,
and also finds them out-of-order.
It can also be used to find out-of-order, multiple attr/val's within the same tag.

This is for your usage:

/<Field(?:Array)?(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sname\s*=\s*(?:(['"])\s*document\s*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+\/>/

Explained/Formatted

 < Field                # Field or  FieldArray  tag
 (?: Array )?

 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s name \s* = \s* 
      (?:
           ( ['"] )               # (1), Quote
           \s* document \s*       # With name = "document"
           \1 
      )
 )
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 />

Running demo: https://regex101.com/r/ieEBj8/1

3 Comments

Dave - This is grade A stuff. If I were you I'd write it down so you don't lose it ..
thanks sln i'm going to study your code. my code is not full html, it's just a string containin Field and FieldArray, i did not understand what do you mean with 'write dow,'
@DDave - If it were just a string containing Field and FieldArray then you can't tell where they begin and end compared to something else without using delimiter parsing rules. Especially when you're looking for a specific attribute / value (or ah, sub-expression I mean). Don't think you're fooling anybody. What I mean by write it down is, this regex form is a gold standard I developed years ago and has been used for big scraping projects. I disseminate it freely, but I don't often fully explain it (by design). This is custom for you, different for someone else, etc..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.