Enable autocomplete querying in ElasticSearch

Question

I am trying to build an ElasticSearch index which will have documents with product names, for instance of laptops -

{ "name" : "Laptop Blue I7"}

Then I want to use it for autocomplete suggestion by querying the ES index. I have 2 main constraints:

There can be Synonyms of the name -

I want to define Synonyms for terms, like "Notebook" for "Laptop" The ingested documents can be of the following kind -

"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Red I7"
"Laptop Red I7"
"Notebook Blue I7"

Now, I am adding the following settings and mapping file while creating the index -

{
  "settings": {
    "index": {
      "analysis": {
        "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : ["Laptop,Notebook"]
                    }
                },
        "analyzer": {
        "synonym" : {
                        "tokenizer" : "keyword",
                        "filter" : ["synonym"]
                    }
}}}}, 
"mappings": {
    "catalog": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "synonym"
        }
      }
    }
  }
}

Querying -

When I query the data, with "Notebook", the preferred response should be ordered in terms of frequency and synonym. However, when I query, the response is normally independent of the synonym and frequency. I use the following query -

/_search
{"query": {
        "query_string" : {"default_field" : "name", "query" : "Notebook"}
            } }

The response I get is -

"Notebook Blue I7"

While I would hope the response to be either of the following -

"Laptop Blue I7"
"Laptop Red I7"

or

"Notebook Blue I7"
"Laptop Blue I7"
"Laptop Red I7"

Any insights in resolving this would be helpful. Thanks

======== Edit 1:

When I use \_analyze on "Notebook" the response is

{'tokens': [{'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'Notebook',
             'type': '<ALPHANUM>'},
            {'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'Laptop',
             'type': 'SYNONYM'}]}

Amit · Accepted Answer · 2019-05-27 10:06:17Z

1

Issue is with your keyword tokenizer which you have used in your synonym analyzer. Please do below things to debug your issue.

Check the tokens generated for your matched and unmatched documents using analyze API.
Use explain API, to understand how its generated tokens and how its matching against your inverted index.

If tokens generated for your documents in Inverted index match with the tokens generated from your search term, then elasticsearch will show it matched and explain query gives a lot of other information like how many documents in a shard matched the search term and its score etc.

Above is just a very basic steps to troubleshoot your issue, but you have not implemented a proper autocomplete search which in turn should return results for note and lapt in your case. To implement this you need to use edge n gram analyzer and this ES official post can help you implement this.

Let me know if you face any other issue or requires any clarification.

answered May 27, 2019 at 10:06

Amit

32.5k7 gold badges68 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

User54211 Over a year ago

The reason I chose keyword over standard was that for multi-word synonyms like "USB3 Cable, Type C", it would split at whitespaces and cause miss-match. I had hoped keyword would preserve the spacing. Is there any other way to deal with multi-word synonyms?

Amit Over a year ago

@User54211 multi-word synonym requires special handling and medium.com/@purbon/… is an interesting read for this.

Amit Over a year ago

@User54211 , meanwhile let me know if you understand why single word synonym didn't work for you?

User54211 Over a year ago

I think I did, I will go through the edge n gram link you shared and try implementing based on that. Once I change the mapping to standard, I am curious if there is a way to get the current code to work for the full query of "Notebook"

Amit Over a year ago

@User54211, can you provide the tokens generated for your documents using the _analyze api link which I shared earlier in my answer?? this will help to identify if we can do something??

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

As Amit mentioned, to implement autocomplete edge n gram is what you should consider. I would like to explain why the setting you used didn't work for the complete word Notebook which when queried didn't yield the expected result. For this lets understand how analyzer above will work.

The synonym analyzer defined in the settings has two components, tokenizer and token filter. For an input string first the tokenizer will be applied. The ouput of the tokenizer will be token(s). These will then act as input of token filter and will result in final set of token(s).

You can read more on how analyzer works here.

Now lets consider the first e.g. `Laptop Blue I7`

For this input string first the keyword tokenizer will be applied and as you might be knowing that the keyword tokenizer takes input string and generate a single token which is the same input string without any modification. So the output of tokenizer will be Laptop Blue I7 as a single token. Now this token will act as input for synonym token filter. According to the definition, Laptop and Notebook are synonyms but none of them matches the token Laptop Blue I7 so ultimately this filter will be doing nothing and will pass on the token as it is. So the final token generated will be Laptop Blue I7.

So when you search for Notebook it will not match the document with name value as above.

Note that if the input string is just Laptop or Notebook you will get the expected tokens because the keyword tokenizer will be generating single word token for the input. This is why _analyze on "Notebook" gives you the expected result.

So the conclusion is that keyword is the culprit here. To solve this we need a tokenizer which will generate seperate tokens as laptop, blue, i7, Easiest way to solve this will be to use standard instead of keyword.

Dealing with multi-word synonym

This answer might help you.

Collectives™ on Stack Overflow

Enable autocomplete querying in ElasticSearch

2 Answers 2

7 Comments

Now lets consider the first e.g. `Laptop Blue I7`

Dealing with multi-word synonym

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Now lets consider the first e.g. Laptop Blue I7

Dealing with multi-word synonym

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Now lets consider the first e.g. `Laptop Blue I7`