3

I need to make a regexp on elasticsearch to filtre some data. The field I filter on is the name of person. The data are not always well formatted (sometimes, there is no first name, sometimes, the family name is followed by a period or a comma or 'comma+first name' or 'point+first name'....).

For example, using "bouchard" I get the following matches:

 "bouchard", "bouchard, m.", "bouchard, j.", "bouchard j.p.", "bouchard. j.p."

I need also to exclude name who begin with same prefixe like "bouchardat".

I tried many regexps and finally found that an exclusion may yield better results:

   "query" :  { "regexp" : {
                    "RECORDEDBY" : "bouchard([^a-z].*)"
    }}

This doesn't work because it returns "bouchard, m.", "bouchard, j.", "bouchard j.p." but not "bouchard. j.p." and not "bouchard".

I try some regexps with + and .* but they don't work.

( "bouchard([^a-z].*.*)" "bouchard([^a-z]*+.*)")

To make it clear, I want to allow:

bouchard
bouchard, m.
bouchard, j.
bouchard j.p.
bouchard. j.p.

I want to exclude

bouchardat

Any advice is welcome.

6
  • Could you please be more specific? What entries do you allow, and which ones do you want to disallow? The documentation says that Elasticsearch regexes are always anchored, so "RECORDEDBY" : "bouchard" will only allow bouchard, and "RECORDEDBY" : "bouchard.+" should allow any values starting with bouchard. Commented Mar 30, 2015 at 10:14
  • Sorry. I want to exclude "bouchardat" and to allow : "bouchard", "bouchard, m.", "bouchard, j.", "bouchard j.p.","bouchard. j.p." and all the entries with the same name followed by a combination of space/point/coma and any word. Commented Mar 30, 2015 at 10:55
  • Then, try using bouchard[^a-zA-Z]* Commented Mar 30, 2015 at 10:56
  • bouchard[^a-zA-Z]* return only "bouchard" but not "bouchard, m.", "bouchard, j.", "bouchard j.p.","bouchard. j.p." Commented Mar 30, 2015 at 10:58
  • "bouchard[^a-zA-Z]*.*" return "bouchard","bouchard, m.","bouchard, j.","bouchard j.p.","bouchardat" . it miss "bouchard. j.p." and allow ""bouchardat" Commented Mar 30, 2015 at 11:01

1 Answer 1

1

In this case, you could use a conditional operator to exclude every [a-z] suffix if no special character like '', '.', or ',' follows the word you are looking for:

((bouchard)+?([ .,]+)[ ,.a-zA-Z]*)|(bouchard[^a-zA-Z]?)

This regexp returns for the condition (there has to be [ .,]+):

bouchard
bouchard, m.
bouchard, j.
bouchard j.p.
bouchard. j.p.

and ignores the stuff after the pipe | where no [ .,]+ applies:

bouchardat

Regex101

Sign up to request clarification or add additional context in comments.

8 Comments

thanks for the explained solution. unfortunaly it don't return "bouchard". the caractere comma, space and point didn't need to be escaped ?
Of course, the ? was missing at the end. Updated and added reges101 link.
And of course, you want to capture the whole group, so updated once more.
thank you very much. thank for the Regex101 site too.
(bouchard+[^a-zA-Z]?) this one will match bouchardddddd (infite number of d) I don't see the reason of this + after d at all
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.