1

We have a specific use-case for our ElasticSearch instance: we store documents which contain proper names, dates of birth, addresses, ID numbers, and other related info.

We use a name-matching plugin which overrides the default scoring of ES and assigns a relevancy score between 0 and 1 based on how closely the name matches.

What we need to do is boost that score by a certain amount if other fields match. I have started to read up on ES scripting to achieve this. I need assistance on the script part of the query. Right now, our query looks like this:

{  
   "size":100,
   "query":{  
      "bool":{  
         "should":[  
            {"match":{"Name":"John Smith"}}
            ]
         }
   },
   "rescore":{  
         "window_size":100,
         "query":{  
            "rescore_query":{  
               "function_score":{  
                  "doc_score":{  
                     "fields":{
                       "Name":{"query_value":"John Smith"},
                       "DOB":{
                        "function":{
                            "function_score":{
                                "script_score":{
                                    "script":{
                                        "lang":"painless",
                                        "params":{
                                            "query_value":"01-01-1999"
                                                 },
                               "inline":"if **<HERE'S WHERE I NEED ASSISTANCE>**"
                             }
                           }
                         }
                       }
                     }
                   }
                 }
               }
             },
             "query_weight":0.0,
             "rescore_query_weight":1.0
           }
         }

The Name field will always be required in a query and is the basis for the score, which is returned in the default _score field; for ease of demonstration, we'll just add one additional field, DOB, which if matched, should boost the score by 0.1. I believe I'm looking for something along the lines of if(query_value == doc['DOB'].value add 0.1 to _score), or something along these lines.

So, what would be the correct syntax to be entered into the inline row to achieve this? Or, if the query requires other syntax revision, please advise.

EDIT #1 - it's important to highlight that our DOB field is a text field, not a date field.

4
  • 1
    Few thoughts off the bat: (1) Rescoring only applies to the top window_size results - are you sure this is acceptible for your use case? It SOUNDS like you're trying to modify relevance based on presence of other fields, so I'd think you'd want to do that across the entire search space instead of just the top results from your original scoring. (2) I don't think you need a script here, as you should just be able to use a list of filter functions instead of script_score functions that apply a static boost if documents match some criteria. Commented Jul 10, 2019 at 14:37
  • Hi @rusnyder - yes we are intentionally only rescoring the top 100 results. And yes, we are trying to modify (boost) the relevance score based on presence of other fields. However, we place the MOST amount of weight on the name field: we want to bring back the most relevant name matches via the base query, then use the rescore query to check those results for additional fields. FYI, we first tried to solve this using function_score and doc_score only and using the weight parameter. The problem with that is that if the DOB did NOT match, it REDUCED the score. We don't want this. Commented Jul 10, 2019 at 15:17
  • 1
    Thanks for clarifying about rescoring, and interesting note regarding your previous attempts. While I'm not sure what you mean by using doc_score (unable to find that documented), I do think I have a solution that doesn't require scripting and gets your desired behavior. Effectively, you can use a bool query for your function_score query that should all your secondary criteria together, then use individual weight functions for each criterium to set how much to add to the score for matches. I'll share a complete answer Commented Jul 10, 2019 at 16:08
  • Ah, I believe the doc_score is proprietary to the name-matching plugin we are using. It's not a well-documented plugin hence your inability to find anything about it. It is probably irrelevant to our discussion in any case. I look forward to your solution. If the weight functions do not also REDUCE the score if the additional field doesn't match, then it will work for me. Any tinkering I did with weight also reduced the score when the field did not match, which we don't want - we want to boost only. Thanks again. Commented Jul 10, 2019 at 16:20

2 Answers 2

2

Splitting to a separate answer as this solves the problem differently (i.e. - by using script_score as OP proposed instead of trying to rewrite away from scripts).

Assuming the same mapping and data as the previous answer, a scripted version of the query might look like the following:

POST /employee/_search
{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Name": "John"
          }
        },
        {
          "match": {
            "Name": "Will"
          }
        }
      ]
    }
  },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": {
        "function_score": {
          "query": {
            "bool": {
              "should": [
                {
                  "match": {
                    "Name": "John"
                  }
                },
                {
                  "match": {
                    "Name": "Will"
                  }
                }
              ]
            }
          },
          "functions": [
            {
              "script_score": {
                "script": {
                  "source": "double boost = 0.0; if (params['_source']['State'] == 'FL') { boost += 0.1; } if (params['_source']['DOB'] == '1965-05-24') { boost += 0.3; } return boost;",
                  "lang": "painless"
                }
              }
            }
          ],
          "score_mode": "sum",
          "boost_mode": "sum"
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

Two notes about the script:

  1. The script uses params['_source'][field_name] to access the document, which is the only way to get access to text fields. This is significantly slower as it requires accessing documents directly on disk, though this penalty might not be too bad in the context of a rescore. You could instead use doc[field_name].value if the field was an aggregatable type, such as keyword, date, or something numeric
  2. DOB here is compared directly to a string. This is possible because we're using the _source field, and the JSON for the documents has the dates specified as strings. This is somewhat brittle, but likely will do the trick
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks again for your efforts! I have attempted to run your query and I'm getting a script exception error. It looks like something is up with the syntax. You can see the full error here: ibb.co/frMMhKF
What version of ES are you running? (Can’t believe I didn’t ask this sooner!)
Version 6.4.2 .
Ruh roh. So apparently this problem is exclusive to ES 6.4.x: discuss.elastic.co/t/…. They refactored the script context and inadvertently removed the ability to access _source from scripts in 6.4.0. I tested in ES 6.4.2 vs. ES 6.5.2, and while it's broken in 6.4.2 it's been fixed in ES 6.5.2. This means that your options are (1) upgrade ES (2) use only doc['State'].value-type access in your script (which may require reindexing as keyword, unless fields like State.keyword exist already)
Aww crap. Well I'm glad it was that easy to identify why it's not working. Our plan is to upgrade to 7.2 as soon as possible, but we are waiting for a version of the plugin that is compatible with 7.2 which will be another month or so. I think we are going to have to reindex (I received an error about field type when trying to use doc.value so definitely need to reindex). Thank you so much for your help, let's leave this open and I will be back as soon as we have a chance to reindex and I can test it again.
|
1

Assuming static weights per additional field, you can accomplish this without using scripting (though you may need to use script_score for any more complex weighting). To solve your issue of directly adding to a document's original score, your rescoring query will need to be a function score query that:

  1. Composes queries for additional fields in a should clause for the function score's main query (i.e. - will only produce scores for documents matching at least one additional field)
  2. Uses one function per additional field, with the filter set to select documents with some value for that field, and a weight to specify how much the score should increase (or some other scoring function if desired)

Mapping (as template)

Adding a State and DOB field for sake of example (making sure multiple additional fields contribute to the score correctly)

PUT _template/employee_template
{
  "index_patterns": ["employee"],
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "Name": {
          "type": "text"
        },
        "State": {
          "type": "keyword"
        },
        "DOB": {
          "type": "date"
        }
      }
    }
  }
}

Sample data

POST /employee/_doc/_bulk
{"index":{}}
{"Name": "John Smith", "State": "NY", "DOB": "1970-01-01"}
{"index":{}}
{"Name": "John C. Reilly", "State": "CA", "DOB": "1965-05-24"}
{"index":{}}
{"Name": "Will Ferrell", "State": "FL", "DOB": "1967-07-16"}

Query

EDIT: Updated the query to include the original query in the new function score in an attempt to compensate for custom scoring plugins.

A few notes about the query below:

  • Setting the rescorers score_mode: max is effectively a replace here, since the newly computed function score should only be greater than or equal to the original score
  • query_weight and rescore_query_weight are both set to 1 such that they are compared on equal scales during score_mode: max comparison
  • In the function_score query:
    • score_mode: sum will add together all the scores from functions
    • boost_mode: sum will add the sum of the functions to the score of the query
POST /employee/_search
{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Name": "John"
          }
        },
        {
          "match": {
            "Name": "Will"
          }
        }
      ]
    }
  },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": {
        "function_score": {
          "query": {
            "bool": {
              "should": [
                {
                  "match": {
                    "Name": "John"
                  }
                },
                {
                  "match": {
                    "Name": "Will"
                  }
                }
              ],
              "filter": {
                "bool": {
                  "should": [
                    {
                      "term": {
                        "State": "CA"
                      }
                    },
                    {
                      "range": {
                        "DOB": {
                          "lte": "1968-01-01"
                        }
                      }
                    }
                  ]
                }
              }
            }
          },
          "functions": [
            {
              "filter": {
                "term": {
                  "State": "CA"
                }
              },
              "weight": 0.1
            },
            {
              "filter": {
                "range": {
                  "DOB": {
                    "lte": "1968-01-01"
                  }
                }
              },
              "weight": 0.3
            }
          ],
          "score_mode": "sum",
          "boost_mode": "sum"
        }
      },
      "score_mode": "max",
      "query_weight": 1,
      "rescore_query_weight": 1
    }
  }
}

12 Comments

I forgot to mention one very important point. The DOB field is currently a text field. We have so many different variations, some not even good date formats, that we had to make it a text field for now. How would this change your answer, if at all?
Wow this is incredibly well-written and detailed answer. I can't wait to see if this works!
Hi @rusnyder while I wait to hear if you think we need to revise any part of your query based on the fact that the DOB field is of type text, I will tell you that I have run your query as-is against my index. The document returned on top is the document I expect (both name and DOB match exactly) but the score is being returned as 107.014 as opposed to what I would expect it to be - 1.03. I do know that when query_weight is anything other than zero while using this plugin, it is allowing the base query score (TF/IDF) to be part of the final score calculation, which we don't want. We...
I see, and sorry for submitting an answer that didn't work! I've updated the query to now include the original Name query as part of the function score, which I'm hoping plays more nicely with the custom scoring plugin. If that doesn't work, I'll craft up a new answer to assist with a scoring script.
Regarding the DOB field being indexed as text, I'd first counter with: Do yourself a favor (if possible!) and index it as a date instead! If that's not possible, then changes to my query would depend on what I was trying to accomplish. If I really needed to do date math on a text field, the only option is running scripts on the document _source, which is generally a really bad idea, but probably not terrible in a rescorer that is only running on hundreds of docs. In a script, I'd parse the date as a LocalDateTime and go from there.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.