Grouping Data and finding common string

Question

I have a generic requirement to group data and get the common string out of it. I am not sure which technology will work for this specific requirement . Hence mentioned it as generic. I currently have access to SQL,Unix commands, SAS , MSTR, Informatica, and Python.

Actual Data:

Field A| Field B | Field c
A      | 001     | Alan J
A      | 001     | Alan Walker 
A      | 001     | Walker Alan 
A      | 001     | Alexander AlanJoseph

Required Output:

Field A | Field B | Field C 
A       | 001     | Alan

Note: Based on grouping of Field A and Field B, need to find out the common pattern among the data in Field C

Could someone please help me out on this ?

This is not a suitable problem for SQL. If you are looking for the longest common substring, then there might be appropriate libraries in an analytic tool/programming language such as SAS or Python. — Gordon Linoff
– Gordon Linoff, Commented Jun 18, 2020 at 10:41
This post may give you some direction- stackoverflow.com/questions/49394121/… — mkRabbani
– mkRabbani, Commented Jun 18, 2020 at 10:52

whymath · Accepted Answer · 2020-06-18 11:36:37Z

1

Welcome to SAS. Your question can be done in many ways, I give you the DOW loop version.

data have;
    input A$ 1-3 B$ 12-15 C$ 21-43;
    cards;
    A      | 001     | Alan 
    A      | 001     | Alan Walker 
    A      | 001     | Walker Alan 
    A      | 001     | Alexander AlanJoseph
    B      | 002     | Jay 
    B      | 002     | Jay Zhou
    B      | 002     | JayJay 
;
run;

proc sort;
    by A B;
run;

data want(keep=A B C);
    length string $1024.;

    do until(last.B);
        set have;
        by A B;
        string = catx('@',string,C);
        count = sum(count,1);
    end;

    do until(last.B);
        set have;
        by A B;
        do i = 1 to count;
            if find(scan(string,i,'@'),cats(C)) then match = sum(match,1);
        end;
        if count = match then output;
    end;
run;

proc print;
run;

answered Jun 18, 2020 at 11:36

whymath

1,40413 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Scott77 Over a year ago

Thanks for the response. I have liked your approach. I have updated my input data set for the first record, in that case what modification is needed in the SAS code to fetch the result? The output has to display the common word for the group of field A and field B. Thanks

whymath Over a year ago

@Scott77 This is very diffcult to me, perhaps I can not help. Maybe you can google it for "longest common substring". I had did it but the anwser were all about 2 strings, not for n(n>2). I will take a good think the several days.

Jim Macaulay · Accepted Answer · 2020-06-18 11:44:34Z

1

Below is the solution,

Sql

select field_a, field_b, field_c from
(select distinct d1.*, row_number() over(partition by d1.field_a order by d1.field_c) 
as rnk from data d1
inner join data d2 on 
(instr(d1.field_c, d2.field_c) > 0)) where rnk=1;

https://dbfiddle.uk/?rdbms=oracle_18&fiddle=ccd9837b8e893ed42f8ade3b7e7e9d40

Informatica

Use Expression transformation, find the patter using INSTR function (you can also use regular expressions). Then the pass the ports to rank transformation and sort it be field_c and fetch top 1 values.

answered Jun 18, 2020 at 11:44

Jim Macaulay

5,2515 gold badges32 silver badges60 bronze badges

1 Comment

Mickaël Bucas Over a year ago

To understand the SQL : search each string in each other ((instr(d1.field_c, d2.field_c) > 0) and then keep the one you find most times in other values where rnk=1

Collectives™ on Stack Overflow

Grouping Data and finding common string

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related