1

I have a generic requirement to group data and get the common string out of it. I am not sure which technology will work for this specific requirement . Hence mentioned it as generic. I currently have access to SQL,Unix commands, SAS , MSTR, Informatica, and Python.

Actual Data:

Field A| Field B | Field c
A      | 001     | Alan J
A      | 001     | Alan Walker 
A      | 001     | Walker Alan 
A      | 001     | Alexander AlanJoseph 

Required Output:

Field A | Field B | Field C 
A       | 001     | Alan

Note: Based on grouping of Field A and Field B, need to find out the common pattern among the data in Field C

Could someone please help me out on this ?

2
  • This is not a suitable problem for SQL. If you are looking for the longest common substring, then there might be appropriate libraries in an analytic tool/programming language such as SAS or Python. Commented Jun 18, 2020 at 10:41
  • This post may give you some direction- stackoverflow.com/questions/49394121/… Commented Jun 18, 2020 at 10:52

2 Answers 2

1

Welcome to SAS. Your question can be done in many ways, I give you the DOW loop version.

data have;
    input A$ 1-3 B$ 12-15 C$ 21-43;
    cards;
    A      | 001     | Alan 
    A      | 001     | Alan Walker 
    A      | 001     | Walker Alan 
    A      | 001     | Alexander AlanJoseph
    B      | 002     | Jay 
    B      | 002     | Jay Zhou
    B      | 002     | JayJay 
;
run;

proc sort;
    by A B;
run;

data want(keep=A B C);
    length string $1024.;

    do until(last.B);
        set have;
        by A B;
        string = catx('@',string,C);
        count = sum(count,1);
    end;

    do until(last.B);
        set have;
        by A B;
        do i = 1 to count;
            if find(scan(string,i,'@'),cats(C)) then match = sum(match,1);
        end;
        if count = match then output;
    end;
run;

proc print;
run;

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the response. I have liked your approach. I have updated my input data set for the first record, in that case what modification is needed in the SAS code to fetch the result? The output has to display the common word for the group of field A and field B. Thanks
@Scott77 This is very diffcult to me, perhaps I can not help. Maybe you can google it for "longest common substring". I had did it but the anwser were all about 2 strings, not for n(n>2). I will take a good think the several days.
1

Below is the solution,

Sql

select field_a, field_b, field_c from
(select distinct d1.*, row_number() over(partition by d1.field_a order by d1.field_c) 
as rnk from data d1
inner join data d2 on 
(instr(d1.field_c, d2.field_c) > 0)) where rnk=1;

https://dbfiddle.uk/?rdbms=oracle_18&fiddle=ccd9837b8e893ed42f8ade3b7e7e9d40

Informatica

Use Expression transformation, find the patter using INSTR function (you can also use regular expressions). Then the pass the ports to rank transformation and sort it be field_c and fetch top 1 values.

1 Comment

To understand the SQL : search each string in each other ((instr(d1.field_c, d2.field_c) > 0) and then keep the one you find most times in other values where rnk=1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.