8

I have documents that I want to index in ElasticSearch that contains a text field called name. I currently index the name using the snowball analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".

I can't seem to find the right combination of analyzer/filters to accomplish this.

2 Answers 2

6

I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:

filter:
    ........
    my_shingle_filter:
        type: shingle
        min_shingle_size: 2
        max_shingle_size: 3
        output_unigrams: true
        token_separator: ""

you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).

Sign up to request clarification or add additional context in comments.

1 Comment

This sounds really promising. Let me check it out and I'll get back to you.
-3

In this case, you might need to look at an ngram type solution.

Ngram does something like this:

Given the text abcd and analyzed with ngram you might get the tokens:

a
ab
abc
abcd
b
bc
bcd
c
cd
d

below is a setting that might work for you.

You might need to tinker with the filter portion. This particular filter creates grams up to 12 units long and a minimum of two tokens.

Now, if you need it to do further analysis that snowball gives you (like water, waters, watering all matching the token water) you will need to tinker yet further.

        "filter": {
            "ngram_filter": {
                "type": "nGram",
                "min_gram": 2,
                "max_gram": 12
            }
        },
        "analyzer": {
            "ngram_index": {
                "filter": [
                    "lowercase",
                    "ngram_filter"
                ],
                "tokenizer": "keyword"
            },
            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "keyword"
            }
        }
    },

The idea here is at indextime you want to create the right tokens to be available at searchtime. But, all you need to do at searchtime is make those tokens available. You don't need to reapply the ngram analyzer again.

EDIT:

One last thing I just noticed, this requirement: "ExxonMobil" should match "exxon mobil"

Probably means you will need do something like this:

            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "whitespace"

            }

Note the addition of the "whitespace" tokenizer instead of keyword. This allows the search to split on whitespace.

6 Comments

@DavidPfeffer min_gram and max_gram set the min and max gram size. You will need to tinker with the size to find the right combination for your use case. 2/12 should be a reasonable start though
Right, but those are min and max character counts, correct? That doesn't help me because I need to combine words, not arbitrary characters inside those words.
those counts are sizes of grams. As I said, you may need to tweek the min max. Your requirements seem like exonmobile, exon, mobile and exon mobile should all match the same document right? If so, my example above should work.
I understand that you said that the gram size must be tweaked. However, the grams are being constructed of arbitrary pieces of text within the words, not the words and only the words. I don't want "onmobi" to match.
@DavidPfeffer I see, well that's asking elasticsearch to know about some list of strings and what is a "valid" word and what isn't and would need to be defined on your end. For example, The Home Depot contains: The, Home, Depot, Pot, Me, He and ExonMobil contains Mob, Exo, On, and possibly Bil. Remember, snowball works by using common gramitical structures (which is why waterly matches documents containing water). Maybe word delimiter can get you in the right direction: elasticsearch.org/guide/en/elasticsearch/reference/current/….
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.