2

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.

For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:

1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری

I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:

  1. I am using MS SQL Server not MySQL
  2. my table contents are in Persian, so I can't use Levenshtein distance and similar methods
  3. I prefer an SQL Server only solution, not an indexing or daemon based solution.

The best solution would be a solution which help us sort result by similarity, but, its optional.

Do you have any suggestion for that?

Thanks

0

6 Answers 6

4

MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?

SELECT * FROM table WHERE input LIKE '%مختار%'
Sign up to request clarification or add additional context in comments.

2 Comments

Yes sorry, wasn't thinking :) Just saw it.
Thanks for your reply, but LIKE is not what I'm looking for. I want to match "مختر" for them too!
3

Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?

--This part searches for a string you want

declare @MyString varchar(max)

set @MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)


--This part searches for that string

select searchColumn, ABS(Len(searchColumn) - Len(@MyString)) as Similarity
from table where data LIKE '%' + @MyString + '%'
Order by Similarity, searchColumn

The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query. The absolute part can be avoided obviously but I did it just in case.

Hope that helps =-)

8 Comments

Yes, its closer to what I'm looking for. But considering the length as similarity factor is not a good idea. for example "test" and "find" are not similar at all, but their length are equal.
But wait a second, the query that I wrote it won't provide "find" as a result if you search for "test". If you search for "test" and there's "test" "find" "tester" and "testing" in your database, the result will be: test tester testing In that order. If you search for "find" instead only 1 result will be provided (find). If you search for "in" two results will be provided: find testing (in that order, due to the length stuff). I don't know persian, but if you are using an nvarchar column the result will be the same regardless the language you use.
Oh, yes, sorry, I didn't see the where clause. But it still don't solve my problem, because I want find to be a match for fnd.
Oooh, okay then. I didn't understand that. Here's the thing, first of all, if you want to do this using LIKE, it'll be a performance killer. Well, actually, if you do this for large character strings it'll be a performance killer anyway due to the immense posibilities of a word containing a couple of characters.Anyway, you first need to make an index in your search column (right click table in obj explorer, full text indexes). After that, you'll be able to use the function CONTAINS. The sintax will be: select SearchColumn from Table where contains(SearchColumn, ' "test" '). (Continues down)
That thing above will do the same thing as the LIKE function using "test". Now, if you wanna add wildcards the sintax is like this: contains(SearchColumn, ' "test" '). That's right, the * in Contains is the same as % in LIKE. Knowing that, you will now need a function that returns a table according to your search needs given a character string. For example, if you give the function "test", it should return a table with test,test*,test*,test*. You could go on with the wildcards but your query will take a lot of time for large tables and strings. (Continues down)
|
1

Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.

2 Comments

There is no instr function in SQL Server. Perhaps you meant SUBSTRING also (authoratitive) citation needed for "generally slower". There is no obvious reason why performance should be different if the LIKE expression has a leading wildcard. Both have to do the same work.
Thanks for your reply, but LIKE and SUBSTRING are not what I'm looking for. I want to match "مختر" for them too!
1

The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.

The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.

And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection

4 Comments

Thanks Oleg, but I think FTS could help, because Persian alphabet is too similar with Arabic alphabet and FTS should support Arabic. But I have no idea on how to use FTS to solve this problem. :(
You want me to fullfil the answer with example of using Fts?
No, I tried your strings with FTS' Arabic module - it simply not working, sorry. So - welcome to the world of Natural Language Processing 8-)
Oh...GOD :( Can you give me an example of using FTS with arabic module please?
0

Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.

Look at the following reference: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for @method Algorithm:

0 The Levenshtein edit distance algorithm

1 The Jaccard similarity coefficient algorithm

2 A form of the Jaro-Winkler distance algorithm

3 Longest common subsequence algorithm

Comments

0

Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

1 Comment

you should re-format your answer. This way it is completely unreadeable and therefor not helpful to the problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.