Elasticsearch "simple_query_string" vs. "query_string" field analysis bug?

Question

Recently we discovered that, since we aren't sanitizing search terms as they come into our system, we would get occasional parsing exceptions in Elasticsearch when special characters such as / (forward slash) , etc. were used w/ "query_string". So, we decided to switch to "simple_query_string". However, we discovered that the same analyzers do not appear to be used for each. I reviewed When Analyzers Are Used to see if it indicated there would be a difference between simple and regular query string but it did not, so I'm wondering if this is a bug. For example:

"query_string": { "query": "sales", "fields": [ "title" ] }

will use the analyzer for the "title" field which is our "en_analyzer" (see definition below) and properly stem "sales" to "sale" and find the matching documents. Simply changing "query_string" to "simple_query_string" will not. We have to search for "sale" or add an analyzer to the query, like so:

"simple_query_string": { "query": "sales", "fields": [ "title" ], "analyzer": "en_analyzer" }

Of course, not all our fields are analyzed the same way and so the default behavior described in the documentation I referenced above makes perfect sense and that's what we desire. Is this a bug or does "simple_query_string" just not behave the same way w/ respect to field analysis during a query? We are using ES 1.7.2.

The relevant parts of our definition for "en_analyzer" are:

"en_analyzer": { "type": "custom", "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding", "shingle_filter" ], "char_filter": [ "html_strip" ] }

with:

"en_stop_filter": { "type": "stop", "stopwords": [ "_english_" ] }, "en_stem_filter": { "type": "stemmer", "name": "minimal_english" }

Link to my same question on Github ... though I edited this one better after I asked on Github first. So far no response there.

tdoman, you could also add your github link too. github.com/elastic/elasticsearch/issues/15550 — Sazzad Hissain Khan
– Sazzad Hissain Khan, Commented Dec 28, 2015 at 19:21
It is working correctly on my dataset, I do not think this is a bug, could you try deleting and recreating index — ChintanShah25
– ChintanShah25, Commented Dec 28, 2015 at 19:35
@ChintanShah25 - yeah I went ahead and tried creating a new index as you suggested just to be sure and sadly, it behaves the exact same way as I described above. For a moment I thought perhaps it might be due to the fact that I am using bm25_similarity algorithm on my "title" field so I tried it w/o bm25 also ... no joy. :( — Thomas Doman
– Thomas Doman, Commented Dec 28, 2015 at 23:15
@Val is right, sorry for my previous comment, my query was wrong, it uses standard analyzer by default — ChintanShah25
– ChintanShah25, Commented Dec 29, 2015 at 14:22

Val · Accepted Answer · 2015-12-29 04:10:12Z

3

In 1.7.2, simple_query_string will use the default standard analyzer when none is specified and won't use any search analyzer defined on the field being searched. When the documentation doesn't tell, one shall turn to the ultimate source of knowledge, i.e. the source code. In SimpleQueryStringParser.java, the class comment states:

analyzer: analyzer to be used for analyzing tokens to determine which kind of query they should be converted into, defaults to "standard"

And a bit further down in the same class, we can read:

Use standard analyzer by default

And that behavior hasn't changed in the ES 2.x releases. As can be seen in the source code for SimpleQueryStringBuilder.java, if no analyzer is specified in the query, then the standard analyzer is used.

Quoting a comment from the source linked above:

Use standard analyzer by default if none specified

So to answer your question, that's not a bug, but the intended behavior.

answered Dec 29, 2015 at 4:10

Val

218k14 gold badges377 silver badges384 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Thomas Doman Over a year ago

Of course! One of the benefits of using open source, I should have checked there. Anyway, thanks for the answer, I appreciate it! Do you have any idea if there's a reason it must behave that way? I suppose it could be in order to make it "simple" but it seems the simplest and most appropriate way would be to behave like query_string in this regard.

Collectives™ on Stack Overflow

Elasticsearch "simple_query_string" vs. "query_string" field analysis bug?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related