Matching with missing spaces in ElasticSearch

Question

I have documents that I want to index in ElasticSearch that contains a text field called name. I currently index the name using the snowball analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".

I can't seem to find the right combination of analyzer/filters to accomplish this.

femtoRgon · Accepted Answer · 2013-11-18 22:54:49Z

6

I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:

filter:
    ........
    my_shingle_filter:
        type: shingle
        min_shingle_size: 2
        max_shingle_size: 3
        output_unigrams: true
        token_separator: ""

you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).

answered Nov 18, 2013 at 22:54

femtoRgon

33.4k7 gold badges67 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David Pfeffer Over a year ago

This sounds really promising. Let me check it out and I'll get back to you.

James R · Accepted Answer · 2013-11-18 17:10:28Z

-3

In this case, you might need to look at an ngram type solution.

Ngram does something like this:

Given the text abcd and analyzed with ngram you might get the tokens:

a
ab
abc
abcd
b
bc
bcd
c
cd
d

below is a setting that might work for you.

You might need to tinker with the filter portion. This particular filter creates grams up to 12 units long and a minimum of two tokens.

Now, if you need it to do further analysis that snowball gives you (like water, waters, watering all matching the token water) you will need to tinker yet further.

        "filter": {
            "ngram_filter": {
                "type": "nGram",
                "min_gram": 2,
                "max_gram": 12
            }
        },
        "analyzer": {
            "ngram_index": {
                "filter": [
                    "lowercase",
                    "ngram_filter"
                ],
                "tokenizer": "keyword"
            },
            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "keyword"
            }
        }
    },

The idea here is at indextime you want to create the right tokens to be available at searchtime. But, all you need to do at searchtime is make those tokens available. You don't need to reapply the ngram analyzer again.

EDIT:

One last thing I just noticed, this requirement: "ExxonMobil" should match "exxon mobil"

Probably means you will need do something like this:

            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "whitespace"

            }

Note the addition of the "whitespace" tokenizer instead of keyword. This allows the search to split on whitespace.

edited Nov 18, 2013 at 17:10

answered Nov 18, 2013 at 17:00

James R

4,6664 gold badges33 silver badges45 bronze badges

6 Comments

James R Over a year ago

@DavidPfeffer min_gram and max_gram set the min and max gram size. You will need to tinker with the size to find the right combination for your use case. 2/12 should be a reasonable start though

David Pfeffer Over a year ago

Right, but those are min and max character counts, correct? That doesn't help me because I need to combine words, not arbitrary characters inside those words.

James R Over a year ago

those counts are sizes of grams. As I said, you may need to tweek the min max. Your requirements seem like exonmobile, exon, mobile and exon mobile should all match the same document right? If so, my example above should work.

David Pfeffer Over a year ago

I understand that you said that the gram size must be tweaked. However, the grams are being constructed of arbitrary pieces of text within the words, not the words and only the words. I don't want "onmobi" to match.

James R Over a year ago

@DavidPfeffer I see, well that's asking elasticsearch to know about some list of strings and what is a "valid" word and what isn't and would need to be defined on your end. For example, The Home Depot contains: The, Home, Depot, Pot, Me, He and ExonMobil contains Mob, Exo, On, and possibly Bil. Remember, snowball works by using common gramitical structures (which is why waterly matches documents containing water). Maybe word delimiter can get you in the right direction: elasticsearch.org/guide/en/elasticsearch/reference/current/….

|

Collectives™ on Stack Overflow

Matching with missing spaces in ElasticSearch

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related