How to find strings which are similar to given string in SQL server?

Question

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.

For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:

1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری

I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:

I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.

The best solution would be a solution which help us sort result by similarity, but, its optional.

Do you have any suggestion for that?

Thanks

keyboardP · Accepted Answer · 2011-12-26 15:16:02Z

4

MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?

SELECT * FROM table WHERE input LIKE '%مختار%'

answered Dec 26, 2011 at 15:16

keyboardP

69.5k13 gold badges162 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

keyboardP Over a year ago

Yes sorry, wasn't thinking :) Just saw it.

Shahab Over a year ago

Thanks for your reply, but LIKE is not what I'm looking for. I want to match "مختر" for them too!

Gaspa79 · Accepted Answer · 2011-12-26 20:12:31Z

3

Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?

--This part searches for a string you want

declare @MyString varchar(max)

set @MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)


--This part searches for that string

select searchColumn, ABS(Len(searchColumn) - Len(@MyString)) as Similarity
from table where data LIKE '%' + @MyString + '%'
Order by Similarity, searchColumn

The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query. The absolute part can be avoided obviously but I did it just in case.

Hope that helps =-)

answered Dec 26, 2011 at 20:12

Gaspa79

5,7156 gold badges44 silver badges66 bronze badges

8 Comments

Shahab Over a year ago

Yes, its closer to what I'm looking for. But considering the length as similarity factor is not a good idea. for example "test" and "find" are not similar at all, but their length are equal.

Gaspa79 Over a year ago

But wait a second, the query that I wrote it won't provide "find" as a result if you search for "test". If you search for "test" and there's "test" "find" "tester" and "testing" in your database, the result will be: test tester testing In that order. If you search for "find" instead only 1 result will be provided (find). If you search for "in" two results will be provided: find testing (in that order, due to the length stuff). I don't know persian, but if you are using an nvarchar column the result will be the same regardless the language you use.

Shahab Over a year ago

Oh, yes, sorry, I didn't see the where clause. But it still don't solve my problem, because I want find to be a match for fnd.

Gaspa79 Over a year ago

Oooh, okay then. I didn't understand that. Here's the thing, first of all, if you want to do this using LIKE, it'll be a performance killer. Well, actually, if you do this for large character strings it'll be a performance killer anyway due to the immense posibilities of a word containing a couple of characters.Anyway, you first need to make an index in your search column (right click table in obj explorer, full text indexes). After that, you'll be able to use the function CONTAINS. The sintax will be: select SearchColumn from Table where contains(SearchColumn, ' "test" '). (Continues down)

Gaspa79 Over a year ago

That thing above will do the same thing as the LIKE function using "test". Now, if you wanna add wildcards the sintax is like this: contains(SearchColumn, ' "test" '). That's right, the * in Contains is the same as % in LIKE. Knowing that, you will now need a function that returns a table according to your search needs given a character string. For example, if you give the function "test", it should return a table with test,test*,test*,test*. You could go on with the wildcards but your query will take a lot of time for large tables and strings. (Continues down)

|

xQbert · Accepted Answer · 2011-12-26 15:23:44Z

1

Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.

answered Dec 26, 2011 at 15:23

xQbert

35.5k2 gold badges46 silver badges67 bronze badges

2 Comments

Martin Smith Over a year ago

There is no instr function in SQL Server. Perhaps you meant SUBSTRING also (authoratitive) citation needed for "generally slower". There is no obvious reason why performance should be different if the LIKE expression has a leading wildcard. Both have to do the same work.

Shahab Over a year ago

Thanks for your reply, but LIKE and SUBSTRING are not what I'm looking for. I want to match "مختر" for them too!

Oleg Dok · Accepted Answer · 2011-12-26 17:54:42Z

1

The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.

The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.

And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection

edited Dec 26, 2011 at 17:54

answered Dec 26, 2011 at 15:33

Oleg Dok

21.9k4 gold badges48 silver badges55 bronze badges

4 Comments

Shahab Over a year ago

Thanks Oleg, but I think FTS could help, because Persian alphabet is too similar with Arabic alphabet and FTS should support Arabic. But I have no idea on how to use FTS to solve this problem. :(

Oleg Dok Over a year ago

You want me to fullfil the answer with example of using Fts?

Oleg Dok Over a year ago

No, I tried your strings with FTS' Arabic module - it simply not working, sorry. So - welcome to the world of Natural Language Processing 8-)

Shahab Over a year ago

Oh...GOD :( Can you give me an example of using FTS with arabic module please?

Alexander · Accepted Answer · 2012-11-26 11:25:09Z

0

Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.

Look at the following reference: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for @method Algorithm:

0 The Levenshtein edit distance algorithm

1 The Jaccard similarity coefficient algorithm

2 A form of the Jaro-Winkler distance algorithm

3 Longest common subsequence algorithm

answered Nov 26, 2012 at 11:25

Alexander

7,8844 gold badges56 silver badges67 bronze badges

Comments

NWAUKWA JOHNWENDY CHINEDU · Accepted Answer · 2021-05-10 12:04:13Z

0

Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

answered May 10, 2021 at 12:04

NWAUKWA JOHNWENDY CHINEDU

1

1 Comment

TheMixy Over a year ago

you should re-format your answer. This way it is completely unreadeable and therefor not helpful to the problem.

Collectives™ on Stack Overflow

How to find strings which are similar to given string in SQL server?

6 Answers 6

2 Comments

8 Comments

2 Comments

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

8 Comments

2 Comments

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related