1

I have two tables tab1 and tab2 and the data like as follows

tab1:

enter image description here

tab2:

enter image description here

Here item description in tab1 and tab2 is not matching is there any way to join these two tables to fetch the customer ids

Thanks

2 Answers 2

1

Try below

#standardSQL
CREATE TEMPORARY FUNCTION similarity(Text1 STRING, Text2 STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };
  var Levenshtein = {
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;
      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;
      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }
      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;
        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }
        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }
      return nextCol;
    }
  };

  var the_Text1;
  try {the_Text1 = decodeURI(Text1).toLowerCase();} catch (ex) {the_Text1 = Text1.toLowerCase();}
  try {the_Text2 = decodeURI(Text2).toLowerCase();} catch (ex) {the_Text2 = Text2.toLowerCase();}
  return Levenshtein.get(the_Text1, the_Text2) / the_Text1.length;
""";

SELECT *, (
  SELECT t1.Item_description
  FROM `project.dataset.tab1` t1
  ORDER BY similarity(t2.Item_description, REPLACE(t1.Item_description, '|', ', ')) 
  LIMIT 1
  ) matched_description
FROM `project.dataset.tab2` t2

If to apply to sample data from your question - result will be

Row Customer_ld Item_description                                        matched_description  
1   1001        Item Lenovo x1 Yoga, i7 14" is delivered                Lenovo x1 Yoga|i7 14"    
2   1002        Lenovo x1 Yoga, i5 13" is delivered to customer         Lenovo x1 Yoga|i5 13"    
3   1003        Lenovo Yoga, i7 14" is delivered to customer@1003       Lenovo Yoga|i7 14"   
4   1004        Item lenovo x1 yoga, i7 14" is delivered successfully   Lenovo x1 Yoga|i7 14"    
5   1005        Item Lenovo x1 Yoga, i7 14" is delivered@1005           Lenovo x1 Yoga|i7 14"    
Sign up to request clarification or add additional context in comments.

1 Comment

There is no guarantee that tab1 item decription always contains pipe symbol. The item description may like 'Acer x90 gaming laptop - 16 '
0

I would use regex to tokenize the features that make each description unique.

with Tab1x as (
  select 
    Item_description,
    ifnull(regexp_extract(Item_description,r'([x][0-9])'),'none') as xspec,
    ifnull(regexp_extract(Item_description,r'([i][0-9])'), 'none') as ispec,
    ifnull(regexp_extract(Item_description,r'([0-9]{2}\")'), 'none') as size
  from Tab1
),
Tab2x as (
  select 
    Customer_id,
    Item_description,
    ifnull(regexp_extract(Item_description,r'([x][0-9])'),'none') as xspec,
    ifnull(regexp_extract(Item_description,r'([i][0-9])'), 'none') as ispec,
    ifnull(regexp_extract(Item_description,r'([0-9]{2}\")'), 'none') as size
  from Tab2
)
select 
  Tab1x.Item_description as Tab1_Item_description,
  Tab2x.Item_description as Tab2_Item_description,
  Tab2x.Customer_id
from Tab1x
left join Tab2x using(xspec,ispec,size)

Note, I didn't touch Lenovo or Yoga, but if your real data set has multiple brands/models, you will need to take care of that in a similar fashion.

1 Comment

Is there any way that we can split item description from tab1 into set of words with out special characters and we can search if all the words present in tab2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.