Python String Matching exactly equal to Postgresql Similarity function

Question

I have been using Similarity function of pg_trgm module in PostgreSQL and now i am searching for a words similarity function similar to Similarity in Python. I have found many methods in python e.g. difflib, nltk, but none of these methods produces results similar to that of Similarity function of PostgreSQL.

I have been using this code for words matching but the results are very different from those of PostgreSQL similarity function. Are these results better than those of Similarity function of PostgreSQL? Is there any method or library that i can use to produce the results similar to PostgreSQL Similarity function?

from difflib import SequenceMatcher
import nltk
from fuzzywuzzy import fuzz

def similar(a,b):
    return SequenceMatcher(None,a,b).ratio()

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

def similarity(s1, s2):
    return 2. * len(longest_common_substring(s1, s2)) / (len(s1) + len(s2)) * 100

print similarity("New Highway Classic Academy Lahore","Old Highway Classic Academy")
print nltk.edit_distance("This is Your Shop","This")
print fuzz.ratio("ISE-Tower","UfTowerong,")

jproffitt · Accepted Answer · 2018-09-18 15:00:04Z

I know this is old, but I had a need for the same thing, and I didn't find anything when Googling for python packages that do trigram similarity the same way that postgres does it.

So I wrote a very basic function to do it. I have tested it on a few strings, and it seems to give the exact same result as postgres does. If you're interested, here it is:

import re


def find_ngrams(text: str, number: int=3) -> set:
    """
    returns a set of ngrams for the given string
    :param text: the string to find ngrams for
    :param number: the length the ngrams should be. defaults to 3 (trigrams)
    :return: set of ngram strings
    """

    if not text:
        return set()

    words = [f'  {x} ' for x in re.split(r'\W+', text.lower()) if x.strip()]

    ngrams = set()

    for word in words:
        for x in range(0, len(word) - number + 1):
            ngrams.add(word[x:x+number])

    return ngrams


def similarity(text1: str, text2: str, number: int=3) -> float:
    """
    Finds the similarity between 2 strings using ngrams.
    0 being completely different strings, and 1 being equal strings
    """

    ngrams1 = find_ngrams(text1, number)
    ngrams2 = find_ngrams(text2, number)

    num_unique = len(ngrams1 | ngrams2)
    num_equal = len(ngrams1 & ngrams2)

    return float(num_equal) / float(num_unique)

user955340 · Accepted Answer · 2017-09-13 14:09:27Z

From the PostgreSQL documentation: https://www.postgresql.org/docs/9.1/static/pgtrgm.html

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

Note: A string is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string "cat" is " c", " ca", "cat", and "at ".

There is no builtin module for this functionality in python. There may be libraries such as fuzzyset that can help - but either way there is no standard function in python for this.

Collectives™ on Stack Overflow

Python String Matching exactly equal to Postgresql Similarity function

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related