0

My database is sync with an Elasticsearch to optimize our search results and request faster.

I have an issue querying the users, I want with a query therm look for my users, it can be part of a name, phone, ip, ...

My actual query is

query_string: { fields: ['id', 'email', 'firstName', 'lastName', 'phone', 'ip'], query: `*${escapeElastic(req.query.search.toString().toLowerCase())}*`}

Where req.query.search is my search and escapeElastic comes from the node module elasticsearch-sanitize because I had issues with some symbols.

I have some issue for example if I query for an ipv6, I will have query: '*2001\\:0db8*' but it will not find anything in the database and it should

Other issue if I have someone with firstName john-doe my query will be query: '*john\\-doe*' and it will not find any result.

Seems that the escape prevent query errors but create some issues in my case.

I do not know if query_string is the better way to do my request, I am open to suggestions to optimize this query

Thanks

1 Answer 1

1

I suspect the analyzer on your fields is standard or similar. This means chars like : and - were stripped:

GET _analyze
{
  "text": "John-Doe",
  "analyzer": "standard"
}

showing

{
  "tokens" : [
    {
      "token" : "john",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "doe",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Let's create our own analyzer which is going to keep the special chars but lowercase them all other chars the same time:

PUT multisearch
{
  "settings": {
    "analysis": {
      "analyzer": {
        "with_special_chars": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      },
      "ip": {
        "type": "ip",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      }
    }
  }
}

Ingesting 2 sample docs:

POST multisearch/_doc
{
  "ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334"
}

POST multisearch/_doc
{
   "firstName": "John-Doe"
}

and applying your query from above:

GET multisearch/_search
{
  "query": {
    "query_string": {
      "fields": [
        "id",
        "email",
        "firstName.with_special_chars",
        "lastName",
        "phone",
        "ip.with_special_chars"
      ],
      "query": "2001\\:0db8* OR john-*"
    }
  }
}

both hits are returned.


Two remarks: 1) note that we were searching .with_special_chars instead of the main fields and 2) I've removed the leading wildcard from the ip -- those are highly inefficient.


Final tips since you asked for optimization suggestions: the query could be rewritten as

GET multisearch/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "tegO63EBG_KW3EFnvQF8"
          }
        },
        {
          "match": {
            "email": "[email protected]"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match": {
            "phone.with_special_chars": "+151351"
          }
        },
        {
          "wildcard": {
            "ip.with_special_chars": {
              "value": "2001\\:0db8*"
            }
          }
        }
      ]
    }
  }
}
  1. Partial id matching is probably an overkill -- either the term catches it or not
  2. email can be simply matched
  3. first- & lastName: I suspect match_phrase_prefix is more performant than wildcard or regexp so I'd go with that (as long as you don't need the leading *)
  4. phone can be matched but do make sure special chars can be matched too (if you use the int'l format)
  5. use wildcard for the ip -- same syntax as in the query string

Try the above and see if you notice any speed improvements!

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for the explication. Only one issue with the optimization, in my code I am receiving only one variable to search over all the fields, I can not apply your optimization has I need a different variable for all fields. In my case I do not know if the request is an ip, a phone number or an email
You're welcome. I suppose that's fine -- you don't need to know. That's why I used a should condition -- so at least 1 should match... Which ever it is, is for ES to decide.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.