0

I'm not sure if this is even possible without writing some advanced algorithm, but is there a way in sql to compare two strings, and get a % of the same number of matching characters in them ? Someone hand-typed a load of strings and I need to make them less unique. For example, If I have "LOT & SIGN LIGHTING", "SIGN LIGHTING", and "ELECTRICIAN" I want to loop through a list of words ("SIGN" "PLUMBING", "ELECTRIC") and return a % for the match, so I can replace the original, if, say, it's over 85% similar.

4
  • 2
    I suspect that you might find Levenshtein distance useful. Some databases have this functionality built-in. Others have user-defined functions for it. Commented Nov 5, 2015 at 19:49
  • 1
    Which SQL database are you using? They all have different string functions. Commented Nov 5, 2015 at 20:28
  • Is it always the full search phrase? So, if you find a "S" and the next four letters match "SIGN" than its a hit? or would "MySiggy" be a 75% hit because of the fitting "Sig"? If the first, the algo would not be so complicated. Just find positions of the first letter and check the next substring... Commented Nov 5, 2015 at 20:36
  • @Shnugo I wouldnt trust it one way or the other, the data I'm working with is rife with misspellings and inconsistencies. Commented Nov 5, 2015 at 21:10

1 Answer 1

2

The SQL standard contains nothing like you're asking. You could write something with a stored procedure, but various SQL databases already contain fuzzy matching functions which can calculate the similarities and differences between strings.

The PostgreSQL fuzzystrmatch module has levenshtein() which will calculate the Levenshtein distance between two strings, basically the number of single character edits you'd need to make to get the same strings.

           "LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN       15                     9                9
PLUMBING   15                     9                9
ELECTRIC   17                     9                3

As you can see, it's not terribly useful at recognizing the relationship between long and short strings. You can weight the cost of inserting, deleting and substituting characters to make this work better. For example, if the cost of a mismatch is increased to 2...

           "LOT & SIGN LIGHTING", "SIGN LIGHTING", "ELECTRICIAN"
SIGN       15                     9                11
PLUMBING   19                     13               13
ELECTRIC   21                     15               3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.