1

I think it's best if I describe my intent and try to break it down to code.

  1. I want users to have the ability of complex queries should they choose to that query_string offers. For example 'AND' and 'OR' and '~', etc.
  2. I want to have fuzziness in effect, which has made me do things I feel dirty about like "#{query}~" to the sent to ES, in other words I am specifying fuzzy query on the user's behalf because we offer transliteration which could be difficult to get the exact spelling.
  3. At times, users search a number of words that are suppose to be in a phrase. query_string searches them individually and not as a phrase. For example 'he who will' should bring me the top match to be when those three words are in that order, then give me whatever later.

Current query:

{
  "indices_boost": {},
  "aggregations": {
    "by_ayah_key": {
      "terms": {
        "field": "ayah.ayah_key",
        "size": 6236,
        "order": {
          "average_score": "desc"
        }
      },
      "aggregations": {
        "match": {
          "top_hits": {
            "highlight": {
              "fields": {
                "text": {
                  "type": "fvh",
                  "matched_fields": [
                    "text.root",
                    "text.stem_clean",
                    "text.lemma_clean",
                    "text.stemmed",
                    "text"
                  ],
                  "number_of_fragments": 0
                }
              },
              "tags_schema": "styled"
            },
            "sort": [
              {
                "_score": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "include": [
                "text",
                "resource.*",
                "language.*"
              ]
            },
            "size": 5
          }
        },
        "average_score": {
          "avg": {
            "script": "_score"
          }
        }
      }
    }
  },
  "from": 0,
  "size": 0,
  "_source": [
    "text",
    "resource.*",
    "language.*"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "inna alatheena",
            "fuzziness": 1,
            "fields": [
              "text^1.6",
              "text.stemmed"
            ],
            "minimum_should_match": "85%"
          }
        }
      ],
      "should": [
          {
            "match": {
                "text": {
                    "query": "inna alatheena",
                    "type": "phrase"
                }
            }
        }
        ]
    }
  }
}

Note: alatheena searched without the ~ will not return anything although I have allatheena in the indices. So I must do a fuzzy search.

Any thoughts?

2 Answers 2

3

I see that you're doing ES indexing of Qur'anic verses, +1 ...

Much of your problem domain, if I understood it correctly, can be solved simply by storing lots of transliteration variants (and permutations of their combining) in a separate field on your Aayah documents.

First off, you should make a char filter that replaces all double letters with single letters [aa] => [a], [ll] => [l]

Maybe also make a separate field containing all of [a, e, i] (because of their "vocative"/transcribal ambiguity) replaced with or something similar, and do the same while querying in order to get as many matches as possible...

Also, TH in "allatheena" (which as a footnote may really be Dhaal, Thaa, Zhaa, Taa+Haa, Taa+Hhaa, Ttaa+Hhaa transcribed ...) should be replaced by something, or both the Dhaal AND the Thaa should be transcribed multiple times.

Then, because it's Qur'anic script, all Alefs without diacritics, Hamza, Madda, etc should be treated as Alef (or Hamzat) ul-Wasl, and that should also be considered when indexing / searching, because of Waqf / Wasl in reading arabic. (consider all the Wasl`s in the first Aayah of Surat Al-Alaq for example)

Dunno if this is answering your question in any way, but I hope it's of some assistance in implementing your application nontheless.

Sign up to request clarification or add additional context in comments.

1 Comment

Dunno if this counts as self-promotion (but this is such a niche area so who cares), here are some stuff I've done in the same problem domain: github.com/bjorn-ali-goransson/arabic-transliteration github.com/learnarabic/learnarabic.github.io
2

You should use Dis Max Query to achieve that.

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost.

Quick example how to use it:

POST /_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "text": {
              "query": "inna alatheena",
              "type": "phrase",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "text": {
              "query": "inna alatheena",
              "type": "phrase",
              "fuzziness": "AUTO",
              "boost": 3
            }
          }
        },
        {
          "query_string": {
            "default_field": "text",
            "query": "inna alatheena"
          }
        }
      ]
    }
  }
}

It will run all of your queries, and the one, which scored highest compared to others, will be taken. So just define your rules using it. You should achieve what you wanted.

2 Comments

This is awesome! I want to try this tonight and let you know how it works out. One other question, the match_phrase query does not support fuzziness, how can I make it to? For example, it will not return anything with just 'inna alatheena' because it's actually 'inna allatheena'
@MohamedElMahallawy You're right. Fuzzy queries don't work with phrase queries. All I can think of would be using char_filter to normalize ll into l. Or even better, use phrase suggester to enable Did You Mean functionality.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.