How to combine a pattern analyzer and char_filter in elasticsearch

Question

I have a keyword field that I would like to tokenize (split on commas), but it may also contain values with "+" characters. For example:

query_string.keywords = Living,Music,+concerts+and+live+bands,News,Portland

When creating the index the following does a nice job of splitting the keywords on commas:

{
    "settings": {
        "number_of_shards": 5,
        "analysis": {
            "analyzer": {
                "happy_tokens": {
                    "type":      "pattern",
                    "pattern":   "([,]+)"
                }
            }
        }
    },
    "mappings": {
        "post" : {
            "properties" : {
                "query_string.keywords" : {
                    "type": "string",
                    "analyzer" : "happy_tokens"
                }
            }
        }
    }
}

How can I add a char_filter (see below) to this to change the +'s to spaces or empty strings?

        "char_filter": {
            "kill_pluses": {
                "type": "pattern_replace",
                "pattern": "+",
                "replace": ""
            }
        }

katahdin · Accepted Answer · 2015-04-10 15:30:56Z

3

I discovered the "mapping" char_filter that can take my plus characters to spaces. After tokenizing I was able to trim the tokens to remove white space.

The custom analyzers page in the elasticsearch guide was a big help.

My working example is below:

{
    "settings": {
        "number_of_shards": 5,
        "index": {
            "analysis": {
                "char_filter": {
                    "plus_to_space": {
                        "type": "mapping",
                        "mappings": ["+=>\\u0020"]
                    }
                },
                "tokenizer": {
                    "split_on_comma": {
                        "type": "pattern",
                        "pattern": "([,]+)"
                    }
                },
                "analyzer": {
                    "happy_tokens": {
                        "type": "custom",
                        "char_filter": ["plus_to_space"],
                        "tokenizer": "split_on_comma",
                        "filter": ["trim"]
                    }
                }
            }
        }
    },
    "mappings": {
        "post" : {
            "properties" : {
                "query_string.keywords" : {
                    "type": "string",
                    "analyzer" : "happy_tokens"
                }
            }
        }
    }
}

answered Apr 10, 2015 at 15:30

katahdin

3997 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Szymon Smakolski Over a year ago

So that was way more involved than anything I had expected. Kudos for a job well done!

Szymon Smakolski · Accepted Answer · 2015-04-09 19:15:26Z

1

You need to escape your "+", as "+" has a special meaning in regular expressions.

    "char_filter": {
        "kill_pluses": {
            "type": "pattern_replace",
            "pattern": "\+",
            "replace": ""
        }
    }

answered Apr 9, 2015 at 19:15

Szymon Smakolski

1321 silver badge4 bronze badges

1 Comment

katahdin Over a year ago

The docs say "Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes", so I'm not sure if that's an issue, but I will check. Any ideas on how to add the character filter prior to the tokenizer?

Collectives™ on Stack Overflow

How to combine a pattern analyzer and char_filter in elasticsearch

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related