I'm creating a plagiarism detection algorithm in Python, which scores how much of document A has been plagiarised in document B, and returns all the copied sections in B that were plagiarised from A.
Here's my working version:
from textdistance import LCSStr
DEFAULT_MIN_SUBSTRING_LEN = 30
def get_doc_lcs_matches(
doc1: str,
doc2: str,
min_substring_len: int = DEFAULT_MIN_SUBSTRING_LEN
):
get_lcs = LCSStr()
while True:
# Find a satisfactory LCS:
lcs = get_lcs(doc1, doc2)
lcs_len = len(lcs)
if lcs_len < min_substring_len:
break
yield lcs
# Remove the found LCS substring from both strings:
lcs_idx_1 = doc1.find(lcs)
lcs_idx_2 = doc2.find(lcs)
doc1 = doc1[:lcs_idx_1] + doc1[lcs_idx_1 + lcs_len:]
doc2 = doc2[:lcs_idx_2] + doc2[lcs_idx_2 + lcs_len:]
def get_similarity(
doc1: str,
doc2: str,
min_substring_len: int = DEFAULT_MIN_SUBSTRING_LEN,
stop_score: float = None
):
matches = []
doc1_len = len(doc1)
count = 0
score = 0.0
for match in get_doc_lcs_matches(doc1, doc2, min_substring_len):
matches.append(match)
count += len(match)
score = count / doc1_len
if stop_score and score >= stop_score:
break
return {
"count": count,
"score": score,
"matches": matches
}
I've added stop_score as a way to stop early if a certain plagiarism threshold has been reached, which helps speed this up considerably for my use case.
My problem is, however, this is still extremely slow for large documents (1 MB+), and I have to process quite a few!
I'm wondering if anyone can suggest anything that would improve efficiency? I'm also open to alternative solutions to my original problem.
--- UPDATE ---
@Pseudonym suggested tokenising. I considered this, but it won't suit my use case unfortunately:
- The documents are technical in nature and are about the same subject, so the vocabulary set is rather small.
- The kind of plagiarism I'm looking for is exclusively the copy & paste variety. I don't particular care to catch slight modifications, so don't need fuzzy or semantic matching.
Hence my current algorithm focuses on LCSS.