0
$\begingroup$

I'm creating a plagiarism detection algorithm in Python, which scores how much of document A has been plagiarised in document B, and returns all the copied sections in B that were plagiarised from A.

Here's my working version:

from textdistance import LCSStr

DEFAULT_MIN_SUBSTRING_LEN = 30


def get_doc_lcs_matches(
    doc1: str,
    doc2: str,
    min_substring_len: int = DEFAULT_MIN_SUBSTRING_LEN
):

    get_lcs = LCSStr()

    while True:

        # Find a satisfactory LCS:
        lcs = get_lcs(doc1, doc2)
        lcs_len = len(lcs)
        if lcs_len < min_substring_len:
            break
        yield lcs

        # Remove the found LCS substring from both strings:
        lcs_idx_1 = doc1.find(lcs)
        lcs_idx_2 = doc2.find(lcs)
        doc1 = doc1[:lcs_idx_1] + doc1[lcs_idx_1 + lcs_len:]
        doc2 = doc2[:lcs_idx_2] + doc2[lcs_idx_2 + lcs_len:]


def get_similarity(
    doc1: str,
    doc2: str,
    min_substring_len: int = DEFAULT_MIN_SUBSTRING_LEN,
    stop_score: float = None
):

    matches = []
    doc1_len = len(doc1)
    count = 0
    score = 0.0

    for match in get_doc_lcs_matches(doc1, doc2, min_substring_len):
        matches.append(match)
        count += len(match)
        score = count / doc1_len
        if stop_score and score >= stop_score:
            break

    return {
        "count": count,
        "score": score,
        "matches": matches
    }

I've added stop_score as a way to stop early if a certain plagiarism threshold has been reached, which helps speed this up considerably for my use case.

My problem is, however, this is still extremely slow for large documents (1 MB+), and I have to process quite a few!

I'm wondering if anyone can suggest anything that would improve efficiency? I'm also open to alternative solutions to my original problem.

--- UPDATE ---

@Pseudonym suggested tokenising. I considered this, but it won't suit my use case unfortunately:

  • The documents are technical in nature and are about the same subject, so the vocabulary set is rather small.
  • The kind of plagiarism I'm looking for is exclusively the copy & paste variety. I don't particular care to catch slight modifications, so don't need fuzzy or semantic matching.

Hence my current algorithm focuses on LCSS.

$\endgroup$

1 Answer 1

1
$\begingroup$

Questions about specific programs or libraries are off-topic here, but let's talk about text similarity algorithms.

I have two suggestions that may help.

First off, consider tokenising the text into words first. It obviously depends on your goals, but I suspect that a score based on the number of matching words would be more informative than a score based on the number of matching characters, because it ignores whitespace. Case-folding is probably a good idea, and you could also consider stemming, if you want to catch near-similarities like changing the tense of a verb.

Second, if your goal is to find long runs of matches, LCSS is probably not the best way of doing this for a reason that you may not have considered: no matter how you compute it, a longest common substring may not be a cleanest common substring.

Consider two documents: Document A contains 10 'a' characters, and document B contains 20 'a' characters.

What you might hope is that a LCSS algorithm would match the 10 characters in document A with the first or last 10 characters in document B. But it's also a correct answer to match every second character in document B. That is, after all, still a common substring with the maximum possible length of 10 characters.

This specific simple case probably won't happen with most efficient LCSS algorithms, but it happens in practice on more complicated examples. The GNU diff program actually has a post-pass on its LCSS algorithm to make the diff "pretty".

A better approach would be to use suffix arrays. If you sort both documents into the same suffix array, which takes linear time and space, then substrings end up together in the sorted order, and finding long common substrings is straightforward.

$\endgroup$
4
  • $\begingroup$ Thank you for the thoughtful response! I clarified my use case a bit more in the question, explaining why tokenizing won't be helpful. I'm wondering if your insight about cleanest common substring still applies, given my clarification. I won't be displaying a visual diff, I only need to know the 3 pieces of info returned from my get_similarity function. $\endgroup$ Commented Mar 8, 2024 at 19:30
  • $\begingroup$ Your suggestion to look into suffix lists was brilliant - thank you! It led me to find this implementation. However, after trying it out, building a suffix list for a 1MB document quickly exhausted the 32GB RAM on my machine. I rewrote the algorithm to store the suffix list in a pygtrie.StringTrie, but even that caused an OOM. Unless I'm missing something, unfortunately suffix lists might be too memory inefficient for the size of documents in my use case. $\endgroup$ Commented Mar 8, 2024 at 19:31
  • $\begingroup$ I don't know about Python specifically, but you might be able to find a binding for libdivsufsort. github.com/y-256/libdivsufsort Or, as I suggested, build a suffix array on words rather than characters. $\endgroup$ Commented Mar 9, 2024 at 4:55
  • $\begingroup$ One more thing, if space is at a premium... en.wikipedia.org/wiki/Compressed_suffix_array $\endgroup$ Commented Mar 13, 2024 at 3:22

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.