how to join two tables with condition may contains regex condition or array condition

Question

I have two tables tab1 and tab2 and the data like as follows

tab1:

tab2:

Here item description in tab1 and tab2 is not matching is there any way to join these two tables to fetch the customer ids

Thanks

Mikhail Berlyant · Accepted Answer · 2020-02-06 21:34:10Z

Try below

#standardSQL
CREATE TEMPORARY FUNCTION similarity(Text1 STRING, Text2 STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };
  var Levenshtein = {
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;
      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;
      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }
      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;
        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }
        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }
      return nextCol;
    }
  };

  var the_Text1;
  try {the_Text1 = decodeURI(Text1).toLowerCase();} catch (ex) {the_Text1 = Text1.toLowerCase();}
  try {the_Text2 = decodeURI(Text2).toLowerCase();} catch (ex) {the_Text2 = Text2.toLowerCase();}
  return Levenshtein.get(the_Text1, the_Text2) / the_Text1.length;
""";

SELECT *, (
  SELECT t1.Item_description
  FROM `project.dataset.tab1` t1
  ORDER BY similarity(t2.Item_description, REPLACE(t1.Item_description, '|', ', ')) 
  LIMIT 1
  ) matched_description
FROM `project.dataset.tab2` t2

If to apply to sample data from your question - result will be

Row Customer_ld Item_description                                        matched_description  
1   1001        Item Lenovo x1 Yoga, i7 14" is delivered                Lenovo x1 Yoga|i7 14"    
2   1002        Lenovo x1 Yoga, i5 13" is delivered to customer         Lenovo x1 Yoga|i5 13"    
3   1003        Lenovo Yoga, i7 14" is delivered to customer@1003       Lenovo Yoga|i7 14"   
4   1004        Item lenovo x1 yoga, i7 14" is delivered successfully   Lenovo x1 Yoga|i7 14"    
5   1005        Item Lenovo x1 Yoga, i7 14" is delivered@1005           Lenovo x1 Yoga|i7 14"

There is no guarantee that tab1 item decription always contains pipe symbol. The item description may like 'Acer x90 gaming laptop - 16 '

rtenha · Accepted Answer · 2020-02-06 18:02:28Z

0

I would use regex to tokenize the features that make each description unique.

with Tab1x as (
  select 
    Item_description,
    ifnull(regexp_extract(Item_description,r'([x][0-9])'),'none') as xspec,
    ifnull(regexp_extract(Item_description,r'([i][0-9])'), 'none') as ispec,
    ifnull(regexp_extract(Item_description,r'([0-9]{2}\")'), 'none') as size
  from Tab1
),
Tab2x as (
  select 
    Customer_id,
    Item_description,
    ifnull(regexp_extract(Item_description,r'([x][0-9])'),'none') as xspec,
    ifnull(regexp_extract(Item_description,r'([i][0-9])'), 'none') as ispec,
    ifnull(regexp_extract(Item_description,r'([0-9]{2}\")'), 'none') as size
  from Tab2
)
select 
  Tab1x.Item_description as Tab1_Item_description,
  Tab2x.Item_description as Tab2_Item_description,
  Tab2x.Customer_id
from Tab1x
left join Tab2x using(xspec,ispec,size)

Note, I didn't touch Lenovo or Yoga, but if your real data set has multiple brands/models, you will need to take care of that in a similar fashion.

answered Feb 6, 2020 at 18:02

rtenha

3,6281 gold badge8 silver badges20 bronze badges

1 Comment

kalyan4uonly Over a year ago

Is there any way that we can split item description from tab1 into set of words with out special characters and we can search if all the words present in tab2

Collectives™ on Stack Overflow

how to join two tables with condition may contains regex condition or array condition

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related