7

I am trying to create a product search for my site, where a user can search for products in multiple languages and (hopefully) get fuzzy search results if there is no exact match.

  • I have a pro_search table that has columns id, pro_id, en, de, es, fr, it.
  • The pro_id column refers to the id of the products in their own table.
  • The en, de, es, fr, it columns have the translated meta of each product in various languages.
  • The meta is just keywords seperated by spaces
  • $term is the search term.
  • $lang refers to the users chosen language

So first I do a basic 'LIKE' SQL query to see if there are matches, if there are no results from this I query all the products and create an array sorted by their similarity using the similar_text() function

For example I search 'shirt' this is fine if the meta for this product just includes the word 'shirt', but if the meta includes 'blue branded tshirt' this is being more descriptive and gives the user a chance to search by brand but means that the search will more than likely go fuzzy rather than be found with a LIKE SQL query.

This is kind of working but I was wondering how this could be improved, is there a better way of searching or how do people normally do it? Should I be splitting the meta into each individual keywords and try to see how many words match rather than matching the term to the whole meta?

    $ids = [];

    $params = ['%'.$term.'%'];
    $sql = "SELECT * FROM pro_search WHERE $lang LIKE ?";
    $stmt = DB::run($sql,$params);

    $count = $stmt->rowCount();
    if($count > 0){

        // product search
        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)){
            $id = $row["pro_id"];
            array_push($ids,$id);
        }
        show_products($ids);

    }else{

        // product fuzzy search
        $sql = "SELECT * FROM pro_search";
        $stmt = DB::run($sql);
        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)){
            $id = $row["pro_id"];
            $result = $row[$lang];
            similar_text($term,$result,$similarity);
            $similar_array[$similarity][] = $id;
        }

        $closest_match = array_keys($similar_array);
        rsort($closest_match);
        $match_count = count($closest_match);

        for($i=0; $i<$match_count; $i++){
            foreach($similar_array[$closest_match[$i]] as $id){
                array_push($ids,$id);
            }
        }
        show_products($ids);
    }

I have asked similar questions before and people have pointed me to different ways of comparing the term against the meta (such as levenshtein), but everything I've seen has been comparing two simple words (like apples and oranges) and this just isn't good enough for a real life application with thousands of products and a user could search for literally anything (as in $term='literally anything';)

Key Questions:

  • Should my meta have just the product name or multiple relevant keywords (too many keywords means an individual word is less similar to the whole)?
  • If I have multiple keywords in the meta should I be taking each individual keyword and comparing it against the search term?
  • Also would it be possible to maybe have negative keywords for individual products.
5
  • Have you thought about using a database only for this requirement? Elasticsearch might be able to handle all this with higher performance than writing custom code. Having multiple concurrent visits on that search will otherwise result in really high loading times Commented Apr 5, 2019 at 8:55
  • 2
    "How do people normally do it?" - the best choice is to build an Elastic index, it has a lot of features you simply cannot achieve in SQL. We've tried SQL in our project, but struggled with typos, aliases, rankings, etc, Elastic has built-in features for natural language processing. Commented Apr 5, 2019 at 9:05
  • I've never used elastic search before but will look into it although I don't like having too many dependencies Commented Apr 5, 2019 at 9:25
  • Which RDBMS are you using? Some engines offer functionalities that can help you out if you don't want to pull in and pay for 3rd parties. SQL Server has Full-Text Search, so does PostgreSQL, and MySQL. Commented Apr 5, 2019 at 9:59
  • I am using mySQL, thanks I will look Full-Text Search as well. I will probably end up testing a combination of things Commented Apr 5, 2019 at 10:03

3 Answers 3

7
+50

You are looking for Full-Text Searches WITH QUERY EXPANSION

MySQL supports text searching by using the LIKE operator and regular expression. However, when the text column is large and the number of rows in a table is increased, using these methods has some limitations:

  • Performance: MySQL has to scan the whole table to find the exact text based on a pattern in the LIKE statement or pattern in the regular expressions.
  • Flexible search: with the LIKE operator and regular expression searches, it is difficult to have a flexible search query e.g., to find product whose description contains car but not classic.
  • Relevance ranking: there is no way to specify which row in the result set is more relevant to the search terms.

Because of these limitations, MySQL extended a very nice feature so-called full-text search. Technically, MySQL creates an index from the words of the enabled full-text search columns and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.

To do that, the columns that will be used for search must be in TEXT type and index of type FULLTEXT, index can be given using ALTER TABLE or CREATE INDEX and if you are using phpMyAdmin to manage your databases, you can do that by going to the Structure of that table, then click on More under Action of that column and choose Fulltext.

After that you can performe a search using MATCH AGAINST syntax. MATCH() takes the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform.

Full-Text Searches WITH QUERY EXPANSION:

In some cases, users want to search for information based on the knowledge that they have. Users use their experience to define keywords to search for information, and typically those keywords are too short.

To help users to find information based on the too-short keywords, MySQL full-text search engine introduces a concept called query expansion.

The query expansion is used to widen the search result of the full-text searches based on automatic relevance feedback (or blind query expansion). Technically, MySQL full-text search engine performs the following steps when the query expansion is used:

  • First, MySQL full-text search engine looks for all rows that match the search query.
  • Second, it checks all rows in the search result and finds the relevant words.
  • Third, it performs a search again based on the relevant words instead of the original keywords provided by the users.

The following example shows you how to search for a product whose product name or meta contains at least one word (shirt tshirt).

SELECT * FROM products WHERE MATCH(product_name,product_meta) AGAINST('shirt tshirt' WITH QUERY EXPANSION)

You can read more info in MYSQL document (the link at the beginning of the answer) and here

Also don't miss How Fine-Tuning MySQL Full-Text Search

Sign up to request clarification or add additional context in comments.

Comments

3

You can use SOUNDEX in sql

SELECT * FROM users 
           WHERE SOUNDEX(job) 
LIKE CONCAT('%',SUBSTRING(SOUNDEX('Manual worker'),2),'%');

And things like Manual worka will work. You just need to ajust the value (currently 2) to match your need.

I see that you already tried the Levenshtein algorithm but you should have an eye on this adaptation (which is also compatible with UTF-8 strings)

For my case, the soundex was more efficient, it will depend on how your users will interact with your application.


But as said in the comment, third parties like ElasticSearch or Algolia can be much more efficient.

For ma part I never used it because the company do not allow us to use third parties softwares. That's why I tried both Levensthein and Soundex

3 Comments

I'm not sure you get what I'm asking, I've looked into levenshtein and soundex previously but that article on soundex is like what I said just comparing two simple words (SOUNDEX('Sure') ,SOUNDEX('Shore')). Say I have products: manual worker, electric worker, manual gearbox, instruction manual Each of these has meta which includes that title and other relevant keywords. I want a user to be able to search for manual or worker in a different language and when it is misspelled.
That's what I tried to explain. By combining Levenshtein and Soundex this could work with your use case. But maybe you will have too much false positive. But if you have the possibility, you will have better result by using Algolia or 3rd parties like that. Sorry if my answer was not clear
Sorry, yeah it's not a very clear question to begin with but don't know how I can be clearer. I'm not sure about comparing an individual term against an array of keywords or against all keywords as a string in such a way that I get the best results without completely slowing down my site
3

You can think a bit differently if you are still designing the system. In terms of search just do exact search and do in db as suggested earlier because that is much faster - but "learn from each interaction".

  • User types in some term
  • You do exact search, if found great.
  • if not do fuzzy search for each part of the term entered. Still not found, you do soundex. You are trying to find something! But present a long list to the user to filter through.
  • Eventually user selects one. As soon as they do, you add the term they entered to the product they selected.

So basic idea being that you are learning from each interaction and enriching your search set. Also whenever a term is user and user actually clicks on your item, you keep a count on that term-product association as your confidence in that term-product association improved.

Similarly whenever you are present an option, very easily user should be able to say "Not this" and then guide them through a pre-selected hierarchy of your items and eventually when they select one, you save their search term to that product.

So over a few months if you are getting enough users you will have a rich organic search term dataset for your product category with confidence level on each term.

7 Comments

Hi, thanks for your answer, this sounds great in theory but I'm not sure how well it would work in practice. For example, if I add terms to the meta based on the click through rate, I might end up getting bad spellings etc being added to the meta, this will just mean that the original exact search is more likely to go fuzzy because there's so many random misspellings in the meta. Or how would you do the initial search to begin with? Like I am comparing a term against a, comma, seperated, list, of, keywords is there a better way to compare the term against individual keywords
actually misspellings are a good example. misspelled items I think should go in meta. Your objective is not to be a 5th grade English teacher but that you provide results to customers efficiently. So if enough people misspell the same way - you have helped all of them. I would not even consider this as "meta-data" - more like search-data. meta-data can be more formal categorization of the product. Just see google, amazon, etc. how misspelled searches work quite well too.
Thanks again for your comment, but how would you do the initial search then if what i'm searching contains multiple misspellings? A SQL LIKE query will return less actual products the more extra terms there are in the DB. As in for the initial search before it goes fuzzy?
If your users agree with this "poor man AI approach", then say that misspelled searches will become valid "over time organically". All inventory systems have some categorization of items. Your organization probably has that too - so your default lookup can be simply exact searches on those. So no misspelled items are ok on day 1. As a person like me starts using the system your "alternate spellings" and association with the products will start growing along with "confidence level".
How would you go about storing the search terms and what product the user clicks on. All I can think of is like an associative array where each word is added to the product when they click on it from a search and then a score added to each word depending on how many times it gets clicked, but this brings me back to my original issue of when i do a new search I need to query all the products to get all of the arrays of terms that have matched the products and that include that word and then order them by the score which is not going to be really efficient
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.