2

I have a keyword field that I would like to tokenize (split on commas), but it may also contain values with "+" characters. For example:

query_string.keywords = Living,Music,+concerts+and+live+bands,News,Portland

When creating the index the following does a nice job of splitting the keywords on commas:

{
    "settings": {
        "number_of_shards": 5,
        "analysis": {
            "analyzer": {
                "happy_tokens": {
                    "type":      "pattern",
                    "pattern":   "([,]+)"
                }
            }
        }
    },
    "mappings": {
        "post" : {
            "properties" : {
                "query_string.keywords" : {
                    "type": "string",
                    "analyzer" : "happy_tokens"
                }
            }
        }
    }
}

How can I add a char_filter (see below) to this to change the +'s to spaces or empty strings?

        "char_filter": {
            "kill_pluses": {
                "type": "pattern_replace",
                "pattern": "+",
                "replace": ""
            }
        }

2 Answers 2

3

I discovered the "mapping" char_filter that can take my plus characters to spaces. After tokenizing I was able to trim the tokens to remove white space.

The custom analyzers page in the elasticsearch guide was a big help.

My working example is below:

{
    "settings": {
        "number_of_shards": 5,
        "index": {
            "analysis": {
                "char_filter": {
                    "plus_to_space": {
                        "type": "mapping",
                        "mappings": ["+=>\\u0020"]
                    }
                },
                "tokenizer": {
                    "split_on_comma": {
                        "type": "pattern",
                        "pattern": "([,]+)"
                    }
                },
                "analyzer": {
                    "happy_tokens": {
                        "type": "custom",
                        "char_filter": ["plus_to_space"],
                        "tokenizer": "split_on_comma",
                        "filter": ["trim"]
                    }
                }
            }
        }
    },
    "mappings": {
        "post" : {
            "properties" : {
                "query_string.keywords" : {
                    "type": "string",
                    "analyzer" : "happy_tokens"
                }
            }
        }
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

So that was way more involved than anything I had expected. Kudos for a job well done!
1

You need to escape your "+", as "+" has a special meaning in regular expressions.

    "char_filter": {
        "kill_pluses": {
            "type": "pattern_replace",
            "pattern": "\+",
            "replace": ""
        }
    }

1 Comment

The docs say "Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes", so I'm not sure if that's an issue, but I will check. Any ideas on how to add the character filter prior to the tokenizer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.