0

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:

"chain": {
    "type": "nested",
    "properties": {
      "url.position": {
        "type": "long"
      },
      "url.full": {
        "type": "text"
      },
      "url.domain": {
        "type": "keyword"
      },
      "url.path": {
        "type": "keyword"
      },
      "url.query": {
        "type": "text"
      }
    }
  }

As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:

GET push_url_chain/_search
{
  "query": {
    "nested": {
      "path": "chain",
       "query": {
          "regexp": {
            "chain.url.query": "aff_c.*"
        }
      }
    }
 },
 "size": 0,
 "aggs": {
   "dataFields": {
      "nested": {
        "path": "chain"
      },
      "aggs": {
        "offers": {
          "terms": {
             "field": "chain.url.domain",
             "size": 30
           }
         }
       }
     }
    }
   }

The above query does produce aggregated results but not the way I want. I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.

I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.

Tha

0

1 Answer 1

1

Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits. Aggregation is applied on top of these documents, so all domains are coming in terms

You need to use nested aggregation to gets only matching terms.

{
  "size": 0, 
  "aggs": {
    "Name": {
      "nested": {
        "path": "chain"
      },
      "aggs": {
        "matched_doc": {
          "filter": {   --> filter for url
              "match_phrase_prefix": {
                "chain.url.query": "abc"
            }
          },
          "aggs": {
            "domain": {
              "terms": {
                "field": "chain.url.domain", -- terms for matched url
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

You can use match_phrase_prefix instead of regex. It has better performance.

Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.