1

I have a bunch of user accounts that I need to create associations with based on DOB postcode etc.

I have the following query:

SELECT DISTINCT CONCAT_WS(' , ' ,a.user_id , GROUP_CONCAT( b.user_id SEPARATOR ' , ' ) )
FROM tbl_users_details a,
tbl_users_details b
WHERE a.user_id != b.user_id
AND a.date_of_birth = b.date_of_birth
AND a.postcode = b.postcode
AND LEVENSHTEIN_RATIO( a.last_name , b.last_name ) > 60
GROUP BY a.user_id

To demonstrate my requirements...

If accounts 1 5 9 and 12 meet the criteria (ie these are the same people)

I will get 4 results in the format

1  , 5 , 9 , 12
5  , 1 , 9 , 12
9  , 1 , 5 , 12
12 , 1 , 5 , 9

I deally I'd like just 1,5,9,12

Any pointers would be great.

thanks people.

3
  • did you try to sort the tables? Commented Oct 20, 2011 at 15:20
  • @krankover its the same table so sorts matter not. Commented Oct 20, 2011 at 15:22
  • @alinoz - there are MANY records of many people that are needed - limiting result set not an option in this instance. Commented Oct 20, 2011 at 15:24

4 Answers 4

2

Can you be more clear in your requirement??

anyways try using Subquery like

Select CONCAT (user.i,',)
from 
(Select Distinct ...... --- ur old code ---- )

Thanks, Shanmugam

Sign up to request clarification or add additional context in comments.

1 Comment

yep discovered while you posted ;)
1

In general, I would do something like this:

SELECT GROUP_CONCAT( user_id )
FROM tbl_users_details
GROUP BY date_of_birth, postcode, last_name

but the Levenshtein distance check makes this problematic, since there's actually no guarantee that LEVENSHTEIN_RATIO(x, y) > n and LEVENSHTEIN_RATIO(y, z) > n imply LEVENSHTEIN_RATIO(x, z) > n. (For example, what if one of your users was named "Anderson", another "Addison" and a third "Atkinson"?) You might want to consider using some other similarity estimation method that actually maps names into distinct groups, such as soundex or metaphone:

SELECT GROUP_CONCAT( user_id )
FROM tbl_users_details
GROUP BY date_of_birth, postcode, SOUNDEX(last_name)

1 Comment

That is very helpful. The reason the levenshtein has been used is because occasionally we hav a user who makes a minor typo on registration, for example we have an Anderson and an Adnerson living at same address with same dob - is pretty likely this is the same person. Working on a massive dataset and to be hones its truggling with queries to attempt this matches enmasse. BUt I am trying a few options - thanks for this nugget though - its alread proving useful.
0

You can include ORDER BY clause into the GROUP_CONCAT function -

... GROUP_CONCAT(b.user_id SEPARATOR ' , ' ORDER BY b.user_id)

1 Comment

yes but as the group concat works on table b this would result in the same results. I need some way of ordering the combined concat AND group_concat as if it were one entity
0

reckon I got it....

SELECT  GROUP_CONCAT(ida ORDER BY ida ASC SEPARATOR ' , ') ids
FROM
(SELECT LEAST(a.user_id, b.user_id ) idbase,a.user_id ida, b.user_id idb
FROM apollo.tbl_users_details a,
apollo.tbl_users_details b
WHERE a.user_id != b.user_id
AND a.date_of_birth = b.date_of_birth
AND a.postcode = b.postcode
AND LEVENSHTEIN_RATIO( a.last_name , b.last_name ) > 60
GROUP BY a.user_id) as sub
GROUP BY idbase;

Running on full data set to test..

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.