0

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:

{
  "query": {
    "bool": {
      "must_not": {
        "terms": {
          "request_id": {
            "query": {
              "match": {
                "processing.message": "OUT Followup Synthesis"
              }
            },
            "fields": [
              "request_id"
            ],
            "_source": false
          }
        }
      }
    }
  }
}

As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.

I'm getting an error with this query:

Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]

How can I achieve my goal using Elasticsearch DSL?

5
  • I don't follow. What's the relation between request_id and processing.message such that you cannot do a normal must_not query and retrieve just the sources' request_ids? Commented Feb 18, 2021 at 17:28
  • @JoeSorocin, the request_id, and processing.message are two fields in the same object. Commented Feb 18, 2021 at 17:40
  • I understand that but no request_id will be equal to OUT Followup Synthesis, right? In which case, why mention request_id in the query at all when you can directly query processing.message and simply return the matching sources' request_ids? Or am I missing something here? Commented Feb 18, 2021 at 18:18
  • 1
    As I said, I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language: SELECT d FROM Data d WHERE d.processing.message = 'IN Followup Sythesis' AND d.request_id NOT IN (SELECT request_id FROM Data WHERE processing.message = 'OUT Followup Sythesis'); Commented Feb 18, 2021 at 20:59
  • I gotcha now! My answer will follow. Commented Feb 18, 2021 at 22:36

1 Answer 1

2

Original question extracted from the comments

I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:

SELECT d FROM   data d
WHERE  d.processing.message = 'IN Followup Sythesis'
       AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis'); 

Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.

So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.


Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.

Taking these 3 documents as examples:

POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}

POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}

POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}

you could run

POST reqs/_search
{
  "size": 0,
  "query": {
    "match": {
      "processing.message": "IN Followup Sythesis"
    }
  },
  "aggs": {
    "subquery_mock": {
      "scripted_metric": {
        "params": {
          "disallowed_msg": "OUT Followup Synthesis"
        }, 
        "init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
        "map_script": """
          def req_id = params._source.request_id;
          def msg = params._source.processing.message;
          
          if (msg.contains(params.disallowed_msg)) {
            state.disallowed_request_ids.add(req_id);
            // won't need this particular doc so continue looping
            return;
          }
          
          if (state.by_request_ids.containsKey(req_id)) {
            // there may be multiple docs under the same ID
            // so concatenate them
            state.by_request_ids[req_id].add(params._source);
          } else {
            // initialize an appendable arraylist
            state.by_request_ids[req_id] = [params._source];
          }
        """,
        "combine_script": """
          state.by_request_ids.entrySet()
            .removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
          return state.by_request_ids
        """,
        "reduce_script": "return states"
      }
    }
  }
}

which'd return only the correct request:

"aggregations" : {
  "subquery_mock" : {
    "value" : [
      {
        "xyz" : [
          {
            "processing" : { "message" : "IN Followup Sythesis" },
            "request_id" : "xyz"
          }
        ]
      }
    ]
  }
}

⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".

💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.


FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much, it works, that exactly what I want.
Nice, you're welcome! Hey BTW I'm launching my elasticsearch handbook very soon and I think you'd find it useful. Let me know what you'd like to learn about and I'll let you know as soon as it's out!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.