I'm not sure if this is even possible without writing some advanced algorithm, but is there a way in sql to compare two strings, and get a % of the same number of matching characters in them ? Someone hand-typed a load of strings and I need to make them less unique. For example, If I have "LOT & SIGN LIGHTING", "SIGN LIGHTING", and "ELECTRICIAN" I want to loop through a list of words ("SIGN" "PLUMBING", "ELECTRIC") and return a % for the match, so I can replace the original, if, say, it's over 85% similar.
-
2I suspect that you might find Levenshtein distance useful. Some databases have this functionality built-in. Others have user-defined functions for it.Gordon Linoff– Gordon Linoff2015-11-05 19:49:46 +00:00Commented Nov 5, 2015 at 19:49
-
1Which SQL database are you using? They all have different string functions.Schwern– Schwern2015-11-05 20:28:23 +00:00Commented Nov 5, 2015 at 20:28
-
Is it always the full search phrase? So, if you find a "S" and the next four letters match "SIGN" than its a hit? or would "MySiggy" be a 75% hit because of the fitting "Sig"? If the first, the algo would not be so complicated. Just find positions of the first letter and check the next substring...Gottfried Lesigang– Gottfried Lesigang2015-11-05 20:36:08 +00:00Commented Nov 5, 2015 at 20:36
-
@Shnugo I wouldnt trust it one way or the other, the data I'm working with is rife with misspellings and inconsistencies.Patrick– Patrick2015-11-05 21:10:35 +00:00Commented Nov 5, 2015 at 21:10
1 Answer
The SQL standard contains nothing like you're asking. You could write something with a stored procedure, but various SQL databases already contain fuzzy matching functions which can calculate the similarities and differences between strings.
The PostgreSQL fuzzystrmatch module has levenshtein() which will calculate the Levenshtein distance between two strings, basically the number of single character edits you'd need to make to get the same strings.
"LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN 15 9 9
PLUMBING 15 9 9
ELECTRIC 17 9 3
As you can see, it's not terribly useful at recognizing the relationship between long and short strings. You can weight the cost of inserting, deleting and substituting characters to make this work better. For example, if the cost of a mismatch is increased to 2...
"LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN 15 9 11
PLUMBING 19 13 13
ELECTRIC 21 15 3