String matching in sql

Question

I'm not sure if this is even possible without writing some advanced algorithm, but is there a way in sql to compare two strings, and get a % of the same number of matching characters in them ? Someone hand-typed a load of strings and I need to make them less unique. For example, If I have "LOT & SIGN LIGHTING", "SIGN LIGHTING", and "ELECTRICIAN" I want to loop through a list of words ("SIGN" "PLUMBING", "ELECTRIC") and return a % for the match, so I can replace the original, if, say, it's over 85% similar.

I suspect that you might find Levenshtein distance useful. Some databases have this functionality built-in. Others have user-defined functions for it. — Gordon Linoff
– Gordon Linoff, Commented Nov 5, 2015 at 19:49
Which SQL database are you using? They all have different string functions. — Schwern
– Schwern, Commented Nov 5, 2015 at 20:28
Is it always the full search phrase? So, if you find a "S" and the next four letters match "SIGN" than its a hit? or would "MySiggy" be a 75% hit because of the fitting "Sig"? If the first, the algo would not be so complicated. Just find positions of the first letter and check the next substring... — Gottfried Lesigang
– Gottfried Lesigang, Commented Nov 5, 2015 at 20:36
@Shnugo I wouldnt trust it one way or the other, the data I'm working with is rife with misspellings and inconsistencies. — Patrick
– Patrick, Commented Nov 5, 2015 at 21:10

Schwern · Accepted Answer · 2015-11-05 20:52:10Z

The SQL standard contains nothing like you're asking. You could write something with a stored procedure, but various SQL databases already contain fuzzy matching functions which can calculate the similarities and differences between strings.

The PostgreSQL fuzzystrmatch module has levenshtein() which will calculate the Levenshtein distance between two strings, basically the number of single character edits you'd need to make to get the same strings.

           "LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN       15                     9                9
PLUMBING   15                     9                9
ELECTRIC   17                     9                3

As you can see, it's not terribly useful at recognizing the relationship between long and short strings. You can weight the cost of inserting, deleting and substituting characters to make this work better. For example, if the cost of a mismatch is increased to 2...

           "LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN       15                     9                11
PLUMBING   19                     13               13
ELECTRIC   21                     15               3

Collectives™ on Stack Overflow

String matching in sql

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related