4

I looking for some elegant way to sort my results first by alphabet and then by numbers.

My current solution is inserting an "~" before numbers using the next sort script, "~" is lexicographically after "z":

"sort": {
  "_script":{
      "script" : "s = doc['name.raw'].value; n = org.elasticsearch.common.primitives.Ints.tryParse(s.split(' ')[0][0]); if (n != null) { '~' + s } else { s }",
      "type" : "string"
  }
 }

but I wonder if there is a more elegant and perhaps more performant solution.

Input:

ZBA ABC ...
ABC SDK ...
123 RIU ...
12B BTE ...
11J TRE ...
BCA 642 ...

Desired output:

ABC SDK ...
BCA 642 ...
ZBA ABC ...
11J TRE ...
12B BTE ...
123 RIU ...

1 Answer 1

3

You can do the same thing at indexing time using a custom analyzer which leverages a pattern_replace character filter. It's more performant to do it at indexing than running a script sort at search time for each query.

It works in the same vein as your solution, i.e. if we detect a number, we prepend the value with a tilde ~, otherwise we don't do anything, yet we do it at indexing time and index the resulting value in the name.sort field.

PUT /tests
{
  "settings": {
    "analysis": {
      "char_filter": {
        "pre_num": {
          "type": "pattern_replace",
          "pattern": "(\\d)",
          "replacement": "~$1"
        }
      },
      "analyzer": {
        "number_tagger": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [],
          "char_filter": [
            "pre_num"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "sort": {
              "type": "string",
              "analyzer": "number_tagger",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

Then you can index your data

POST /tests/test/_bulk
{"index": {}}
{"name": "ZBA ABC"}
{"index": {}}
{"name": "ABC SDK"}
{"index": {}}
{"name": "123 RIU"}
{"index": {}}
{"name": "12B BTE"}
{"index": {}}
{"name": "11J TRE"}
{"index": {}}
{"name": "BCA 642"}

Then your query can simply look like this:

POST /tests/_search
{
  "sort": {
    "name.sort": "asc"
  }
}

And the response you'll get is:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "name": "ABC SDK"
        }
      },
      {
        "_source": {
          "name": "BCA 642"
        }
      },
      {
        "_source": {
          "name": "ZBA ABC"
        }
      },
      {
        "_source": {
          "name": "11J TRE"
        }
      },
      {
        "_source": {
          "name": "12B BTE"
        }
      },
      {
        "_source": {
          "name": "123 RIU"
        }
      }
    ]
  }
}
Sign up to request clarification or add additional context in comments.

14 Comments

I like to do the change at indexing time but there is no smarter solution than add a tilde before the number? that fails to convince me.
The tilde is only added to a field you only use for sorting. Lexicographical sorting is what it is. The original input is not changed at all. The solution is the same as yours just carried out at indexing time instead of leveraging costly scripting at search time
Yes, I understand that, but I would like to find another way different than save another field with the tilde (~) before the number
May I ask what bothers you about this solution?
I can provide another solution which figures out a sort number, but that still needs to add another field just for sorting, since you want to change the way how lexicographic sorting works.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.