Pick a random string at a fixed edit distance

Question

I have string \$s\$ of length \$n\$ and some constant integer \$k\$ which is at most \$n\$. Give the fastest algorithm to sample a random string with Levenshtein distance \$k\$ from \$s\$ uniformly.

Your algorithm should output any of the strings with edit distance exactly \$k \leq n\$ from the input string \$s\$ with the same probability. You should assume that only the characters in the string \$s\$ are used.

For example, if \$s\$ = "1111011010" the number of strings with distance \$k\$ is

 k | # of results

 0              1
 1             28
 2            302
 3           1652
 4           5533
 5          14533
 6          34808
 7          80407
 8         180663
 9         395923

If \$k\$ = 9 you should output one of the 395923 possible output strings chosen uniformly at random.

As rejection sampling (do k random edits, see if it is actually at distance k) won't work (longer results have less chance, some strings have multiple edit sequences of length k), the only viable answer is to generate all of them and pick one. The number of strings at distance k does grow exponentially, so if someone finds a strictly better algorithm than \$O(2^{|s|+k}\times |s|(|s|+k))\$ time (enumerate all strings up to length \$|s|+k\$, compute the Levenshtein distance for all of them), that would be very interesting. — Bubbler
– Bubbler, Commented Aug 30, 2023 at 23:11
The Metropolis–Hastings algorithm could be used to sample from this distribution, but the sampling is only approximate. Additionally, rejection sampling could work, if you count the number of different edit sequences which give a particular string (although you need a good bound on it, which might be mathematically challenging to get) — Command Master
– Command Master, Commented Aug 31, 2023 at 6:39
@CommandMaster Then I guess I'm not getting how you'd have the matrix in the first place, because it assumes that you have the second string...? — Bubbler
– Bubbler, Commented Aug 31, 2023 at 9:30
It think the simplest solution would be to build a DFA representing possible solutions. Then sampling is easy. — user1502040
– user1502040, Commented Aug 31, 2023 at 9:54
What about \$ k = 10 \$? The question says that \$ k \$ can go up to \$ n \$ inclusive of \$ n \$ as well — Rhaixer
– Rhaixer, Commented Aug 31, 2023 at 12:02

gsitcia · Accepted Answer · 2023-09-06 01:09:26Z

Python, 3453 bytes

import random

def skip():pass

class State:
    states = []
    def __init__(self, accepting=False):
        self.n = len(State.states)
        State.states.append(self)
        self.accepting = accepting
        self.transitions = {}
    def __repr__(self):
        return f'q_{self.n}'
    def add(self, c, s):
        self.transitions.setdefault(c, set()).add(s)
    def __eq__(self, o):
        return self.n == o.n
    def __hash__(self):
        return hash((State, self.n))
    def epsilon_helper(self):
        z = {self}
        if '' in self.transitions:
            for s in self.transitions.pop(''):
                z.update(s.epsilon_helper())
        self.epsilon_helper = lambda:z
        return z 
    def remove_epsilon(self):
        self.remove_epsilon = skip
        for a in self.epsilon_helper():
            a.remove_epsilon()
            for x, S in a.transitions.items():
                for b in list(S):
                    b.remove_epsilon()
                    for c in b.epsilon_helper():
                        self.add(x, c)
                        c.remove_epsilon()
    def freeze(self):
        self.freeze = lambda:None
        for c, s in self.transitions.items():
            for x in s:
                x.freeze()
            self.transitions[c] = frozenset(s)
    def count_accepting(self):
        x = self.accepting + sum(s.count_accepting() for S in self.transitions.values() for s in S)
        self.count_accepting = lambda:x
        return x
    def accept(self, w):
        if w == '': return self.accepting
        return any(s.accept(w[1:]) for s in self.transitions.get(w[0],()))
    def pick(self, i):
        if self.accepting:
            if i == 0: return ''
            i -= 1
        for j, s in self.transitions.items():
            s = min(s)
            x = s.count_accepting()
            if i < x:
                return j + s.pick(i)
            i -= x
        raise IndexError

def levenshtein(w, n):
    A = set(w)
    W = len(w)
    states = [[State(j <= i) for j in range(len(w),-1,-1)] for i in range(n,-1,-1)]
    for i, c in enumerate(w):
        for j in range(n):
            s = states[j][i]
            for x in A:
                s.add(x, states[j+1][i])
                if x != c:
                    s.add(x, states[j+1][i+1])
            s.add('', states[j+1][i+1])
            s.add(c, states[j][i+1])
        states[n][i].add(c, states[n][i+1])
    for j in range(n):
        s = states[j][W]
        for x in A:
            s.add(x, states[j+1][W])
    s = states[0][0]
    s.remove_epsilon()
    s.freeze()
    return s

def union(states):
    d = {}
    accepting = False
    for s in states:
        for c, S in s.transitions.items():
            d[c] = d.get(c, frozenset()) | S
        if s.accepting:
            accepting = True
    return d, accepting

def difference(q0, q1, D = None):
    if D is None:
        return difference(frozenset([q0]), frozenset([q1]), {})
    if (q0, q1) in D:
        return D[q0, q1]
    d0, a0 = union(q0)
    d1, a1 = union(q1)
    Q0 = State(a0 and not a1)
    D[q0, q1] = Q0
    for i in set(d0) | set(d1):
        Q0.transitions[i] = frozenset([difference(d0.get(i, frozenset()), d1.get(i, frozenset()), D)])
    return Q0

def exact(w, n):
    if n == 0: raise ValueError
    return difference(levenshtein(w, n), levenshtein(w, n-1))

def f(w, n):
    s = exact(w, n)
    return s.pick(random.randrange(s.count_accepting()))

Attempt This Online!

My attempt at implementing user1502040's idea of sampling the DFA.

The algorithm runs in basically three stages:

First it generates NFAs for s,k and s,k-1 (where each accepts words with at most k or k-1 errors respectively)
- The NFAs have \$O(|s|\cdot k)\$ states, and building them takes time proportional to that (also proportional to the size of the alphabet)
Then it uses the algorithm outlined in this cs.se answer to find a DFA for the difference of two NFAs
- This takes the majority of the time, but I'm not sure of the actual complexity. Given that the algorithm is basically the powerset construction, it's worst case exponential, but it appears to do better than that. (I found an algorithm that can find a DFA for a given word with <=k errors that takes time linear with respect to the length of the word (for fixed k). See Schulz and Mihov (2002)).
- I tried graphing the number of states in the DFA for s="1111011012" and k from 1 to 30 and it appeared linear (especially for k>10), but I also tried s="the quick brown fox jumps over the lazy dog" and k from 1 to 7 and it appeared exponential. **
It counts the number of states, and picks a random index and chooses the path with that index in a preorder traversal of the DFA. (am I using my jargon right?)
- This takes time linear in the size of the DFA (using dynamic programming).

In practice, the program appears to run quite quickly; it takes about 5 seconds for s="the quick brown fox jumps over the lazy dog" and k=5.

** For some reason the image feature isn't working for me, so here is the data: [35, 101, 228, 436, 722, 1080, 1464, 1833, 2184, 2519, 2844, 3162, 3480, 3798, 4116, 4434, 4752, 5070, 5388, 5706, 6024, 6342, 6660, 6978, 7296, 7614, 7932, 8250, 8568] and [172, 669, 2579, 10226, 40094, 155722, 606044]

This program doesn’t seem to be able to generate the empty string, which should be a valid string if k = n. — Anders Kaseorg
– Anders Kaseorg, Commented Sep 5, 2023 at 20:01
@AndersKaseorg Thanks for pointing that out! I've changed it so that in the initial NFA, every state that has a chain of epsilon-transitions to an accepting state is also marked accepting. — gsitcia
– gsitcia, Commented Sep 6, 2023 at 1:12
Short of an algorithmic breakthrough, this may be as good as it gets. Thank you — Simd
– Simd, Commented Sep 9, 2023 at 10:03

bsoelch · Accepted Answer · 2023-09-06 07:01:54Z

Python, 932 bytes, \$O(4^k\cdot n^{2k+2})\$

import random
def levenshteinDistance(s,t):
  m=len(s);
  n=len(t);
  d=[(n+1)*[0]for _ in range(m+1)]
  for i in range(m+1):
    d[i][0]=i
  for j in range(n+1):
    d[0][j]=j
  for j in range(n):
    for i in range(m):
      if s[i] == t[j]:
        substitutionCost = 0
      else:
        substitutionCost = 1
      d[i+1][j+1] = min(d[i][j+1] + 1,                # deletion
                         d[i+1][j] + 1,               # insertion
                         d[i][j] + substitutionCost)  # substitution
  return d[m][n]

def build(s,k,i0=0):
  if k==0:
    yield s;return
  for i in range(i0,len(s)+1):
    t=s[:i]+s[i+1:]
    for q in build(t,k-1,i):
     yield q
    for c in set(s):
      t=s[:i]+c+s[i+1:]
      for q in build(t,k-1,i):
       yield q 
      t=s[:i]+c+s[i:]
      for q in build(t,k-1,i):
       yield q

def f(s,k):
  return random.choice([t for t in set(build(s,k))if levenshteinDistance(s,t)==k])

Attempt This Online!

Runs in \$O(n(n+k)(2n(n+k))^k\$

Generates all possible strings reachable with k edits, then pick a random string with edit distance k.

The filtering by edit distance is necessary to filter out combinations of edits the can be expressed by a shorter edit sequence (e.g. Hello -> HHello -> Heello -> Hello), I did not find any generation rule that reliably avoids every possible way to generate a shorter edit sequence.

Running time:

I am not that familiar with estimating the exact running time of recursive functions, so there might be a mistake in my calculations

levenshteinDistance(s,t) runs in \$O(|s||t|)\$

setting \$n=|s|\$

build(s,k,i) runs in \$b(l,k,i)=O(\sum_{j=i}^{l}*(b(l-1,k-1,j)+n(b(l+1,k-1,j)+b(l,k-1,j)))\$ \$=\sum_{j=i}^{l}*O((2n+1)*b(l+1,k-1,j))=O(((n+k)*(2n+1))^k)\$

f runs in \$O(((n+k)*(2n+1))^k)*O(n(n+k))=O(n(n+k)(2n(n+k))^k)\$

or assuming \$k\le n\$ \$O(n(n+n)(2n(n+n))^k)=O(4^k\cdot n^{2k+2})\$

If you're willing to play it fast and loose with the fact that \$k\leq |s|\$ your complexity can be simplified to \$\mathcal{O}(4^k |s|^{2k+2})\$. — Value Ink
– Value Ink, Commented Sep 6, 2023 at 0:33

Value Ink · Accepted Answer · 2023-09-06 00:04:29Z

Python, \$\mathcal{O}((n+k)^{2k})\leq\mathcal{O}(4^k n^{2k})\$

from random import choice
from collections import deque
from sys import stderr

def randomWithFixedDistance(s: str, k: int) -> str:
  if k <= 0: return s

  chars: set[str]          = set(s)
  distances: dict[str,int] = {s: 0}
  queue                    = deque((s,))
  valid: set[str]          = set()

  while len(queue) > 0:
    current = queue.popleft()
    nextdist = distances[current] + 1
    for nxt in insertions(current, chars) | substitutions(current, chars) | removals(current):
      if nxt in distances: continue
      distances[nxt] = nextdist
      if nextdist < k:
        queue.append(nxt)
      else:
        valid.add(nxt)

  print(f"DEBUG: Found {len(valid)} possible strings", file=stderr)
  return choice(list(valid))

def insertions(s: str, chars: set[str]) -> set[str]:
  result = set()
  for i in range(len(s)+1):
    result.update(s[:i] + c + s[i:] for c in chars)
  return result

def substitutions(s: str, chars: set[str]) -> set[str]:
  result = set()
  for i in range(len(s)):
    result.update(s[:i] + c + s[i+1:] for c in chars if c != s[i])
  return result

def removals(s:str) -> set[str]:
  return set(s[:i] + s[i+1:] for i in range(len(s)))

Attempt This Online!

Uses BFS to construct the sampling list, by using the principles of how Levenshtein distance is calculated to walk through the set of insertions, substitutions and deletions necessary. The process used actually looks somewhat similar to bsolech's answer, but without the need to check the Levenshtein distance of every resulting string at the end, since it keeps track of the growing distances while building the list. I believe this makes it faster than their answer by an order of about \$O(n^2)\$, although it's still nowhere near close to gsitcia's answer. Oh well.

Efficiency reasoning is as follows:

\$f(s,1)\$ looks at up to \$n\times(n+1)\$ insertions, which is \$\mathcal{O}(n^2)\$. Substitutions and deletions are always fewer than insertions, being \$n\times(n-1)\$ and \$n\$ respectively, so they get wrapped into this big-O.
\$f(s,k)\$, for each of the strings from \$f(s,k-1)\$, looks at a number of strings on the order of \$\mathcal{O}((n+k)^2)\$, so \$\mathcal{O}((n+k)^2 \times f(s,k-1))\$.
This means the entire function is \$\mathcal{O}(\prod_{i=1}^k (n+i)^2) = \mathcal{O}((n+k)^{2k})\$. And since \$k\leq n\$ this can also be expressed as the (IMO) cleaner \$\mathcal{O}((2n)^{2k}) = \mathcal{O}(2^{2k} n^{2k}) = \mathcal{O}((2^2)^k n^{2k}) = \mathcal{O}(4^k n^{2k})\$.

"although it's still nowhere near close to gsitcia's answer." Isn't it actually faster ? — Simd
– Simd, Commented Sep 6, 2023 at 8:39
@Simd it isn't even close. gsitcia's answer calculates f("1111011010",9) in a quarter of a second while mine takes nearly 25 seconds to complete. Similarly, my code hits the 60-second timeout limit on ATO for f("the quick brown fox jumps over the lazy dog", 4) but they can do it in about 2 seconds. — Value Ink
– Value Ink, Commented Sep 6, 2023 at 18:44

Stack Exchange Network

Pick a random string at a fixed edit distance

3 Answers 3

Python, 3453 bytes

Python, 932 bytes, \$O(4^k\cdot n^{2k+2})\$

Running time:

Python, \$\mathcal{O}((n+k)^{2k})\leq\mathcal{O}(4^k n^{2k})\$

Your Answer

Linked

Hot Network Questions

Pick a random string at a fixed edit distance

3 Answers 3

Python, 3453 bytes

Python, 932 bytes, \$O(4^k\cdot n^{2k+2})\$

Running time:

Python, \$\mathcal{O}((n+k)^{2k})\leq\mathcal{O}(4^k n^{2k})\$

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions