5

I'm using some Java code to do fast prefix lookups, using java.util.TreeSet, could I be using scala's TreeSet instead? Or a different solution?

/** A class that uses a TreeSet to do fast prefix matching
 */
class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String): List[String] = {
    val matches = new ListBuffer[String]
    val tailSet = _set.tailSet(prefix)
    for ( tail <- tailSet.toArray ) {
      val tailString = tail.asInstanceOf[String]
      if ( tailString.startsWith(prefix) ) 
        matches += tailString
      else
        return matches.toList    
    }

    matches.toList
  }
}
2
  • While tries are tree-structured, I don't see how a TreeSet is going to help you implement a trie. The use of the tree in a TreeSet is just to exploit the ordering on the elements to speed contains testing and insertion. The tree aspect is not a part of its API. Commented Dec 9, 2009 at 16:01
  • The fact that the class is named Trie is probably a mistake, I copied it from somewhere else. So, ignoring the name, I think TreeSet is a reasonable approach for matching prefixes. Commented Dec 9, 2009 at 16:13

5 Answers 5

14

Use a Trie. Nobody's actually posted a Trie here yet, despite the fact that some people have posted sorted TreeMap data structures that they have misnamed as tries. Here is a fairly representative sample of a Trie implementation in Java. I don't know of any in Scala. See also an explanation of Tries on Wikipedia.

Sign up to request clarification or add additional context in comments.

1 Comment

There is a Patricia Trie implementation in scala used by scala-wurfl project. The last source here
6

The from & takeWhile approach:

class PrefixMatcher {
    private val _set = new TreeSet[String]
    def add(s: String) = _set.add(s)
    def findMatches(prefix: String): Iterable[String] =
        _set from prefix takeWhile(_ startsWith prefix)
}

An alternative is to select a subset from prefix to prefix++ (the smallest string after the prefix). This selects only the range of the tree that actually starts with the given prefix. Filtering of entries is not necessary. The subSet method will create a view of the underlying set.

There's still some work (overflow and empty strings won't work) in the increment method but the intent should be clear.

class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String) : Set[String] = {
    def inc(x : String) = { //ignores overflow
       assert(x.length > 0) 
       val last = x.length - 1
       (x take last) + (x(last) + 1).asInstanceOf[Char]
    }
   _set.subSet(prefix, inc(prefix))
  }
}

The same works with the scala jcl.TreeSet#range implementation.

9 Comments

Thanks for the suggestion, I think the from syntax looks simpler for this scenario than subSet.
Iterating over the set is even simpler. :-) The difference is the runtime behavior. Selecting a subset for the prefix is quite selective and will skip most of the tree.
Heh. I'm comparing using the method 'from' to get a subset with no end (so you have to check to see if you've reached the end), to your suggestion of using subSet, both skip most of the tree I think.
Yes you're right. I've added a from version that does abort when the end of prefix is reached (using takeWhile). I suppose this has the same runtime behavior as your implementation and the subSet implementation. It's quite readable.
Great. I tried to upvote you, and turns out I'd already upvoted you, so it downvoted you, and now it says too old to change, sorry!!
|
3

As I understand it, the Scala TreeSet is backed by the Java TreeSet, but using the Scala variant would allow you to shorten up the code using a sequence comprehension (http://www.scala-lang.org/node/111) giving you an implementation that looked something like (for Scala 2.7):

import scala.collection.jcl.TreeSet;

class PrefixMatcher 
{
    private val _set = new TreeSet[String]

    def add(s: String) = _set.add(s)

    def findMatches(prefix: String): Iterable[String] =
        for (s <- _set.from(prefix) if s.startsWith(prefix)) yield s
}

object Main
{
    def main(args: Array[String]): Unit =
    {
        val pm = new PrefixMatcher()

        pm.add("fooBar")
        pm.add("fooCow")
        pm.add("barFoo")

        pm.findMatches("foo").foreach(println)
    }
}

Apologies for any bad Scala style on my part, I'm just getting used to the language myself.

5 Comments

@pyo: It looks like your for loop loops through all strings in the set, is that right? In the java code I call tailSet, which does something like: return an iterator to the nodes that start with the prefix specified in order, allowing us to find the matching items much faster than comparing every item in the set.
@Alex Black: indeed, I missed out a call to .from in the sequence comprehension, edited in now.
@pyo: one more question, will your implementation break when s.startsWith(prefix) returns false as the one I posted does?
Nope, it will test the startsWith on every String 'from' prefix to the end of the set
Truth be told I was unaware of the significance of the term, I merely kept it in from the original question. I shall duly edit it now ;)
2

I blogged about finding matches for a combination of prefixes a while ago. It's a harder problem, as you don't know when one prefix ends and the other begins. It might interest you. I'll even post below the code that I did not blog (yet, hopefully :), though it is stripped of all comments, none of which were made in English:

package com.blogspot.dcsobral.matcher.DFA

object DFA {
  type Matched = List[(String, String)]
  def words(s : String) = s.split("\\W").filter(! _.isEmpty).toList
}

import DFA._
import scala.runtime.RichString

class DFA {
  private val initialState : State = new State(None, "")
  private var currState : State = initialState
  private var _input : RichString = ""
  private var _badInput : RichString = ""
  private var _accepted : Boolean = true

  def accepted : Boolean = _accepted
  def input : String = _input.reverse + _badInput.reverse 

  def transition(c : Char) : List[(String, Matched)] = {
    if (c == '\b') backtrack
    else {
      if (accepted) {
        val newState = currState(c)
        newState match {
          case Some(s) => _input = c + _input; currState = s
          case None => _badInput = c + _badInput; _accepted = false
        }
      } else {
        _badInput = c + _badInput
      }
      optionList
    }
  }

  def transition(s : String) : List[(String, Matched)] = {
    s foreach (c => transition(c))
    optionList
  }

  def apply(c : Char) : List[(String, Matched)] = transition(c)
  def apply(s : String) : List[(String,Matched)] = transition(s)

  def backtrack : List[(String, Matched)] = {
    if(_badInput isEmpty) {
      _input = _input drop 1
      currState.backtrack match {
        case Some(s) => currState = s
        case None =>
      }
    } else {
      _badInput = _badInput drop 1
      if (_badInput isEmpty) _accepted = true
    }
    optionList
  }

  def optionList : List[(String, Matched)] = if (accepted) currState.optionList else Nil

  def possibleTransitions : Set[Char] = if (accepted) (currState possibleTransitions) else Set.empty

  def reset : Unit = {
    currState = initialState
    _input = ""
    _badInput = ""
    _accepted = true
  }

  def addOption(s : String) : Unit = {
    initialState addOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
  def removeOption(s : String) : Unit = {
    initialState removeOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
}

class State (val backtrack : Option[State],
             val input : String) {
  private var _options : List[PossibleMatch] = Nil
  private val transitions : scala.collection.mutable.Map[Char, State] = scala.collection.mutable.Map.empty
  private var _possibleTransitions : Set[Char] = Set.empty

  private def computePossibleTransitions = {
    if (! options.isEmpty)
      _possibleTransitions = options map (_.possibleTransitions) reduceLeft (_++_)
    else
      _possibleTransitions = Set.empty
  }

  private def computeTransition(c : Char) : State = {
    val newState = new State(Some(this), input + c)
    options foreach (o => if (o.possibleTransitions contains c) (o computeTransition (newState, c)))
    newState
  }

  def options : List[PossibleMatch] = _options
  def optionList : List[(String, Matched)] = options map (pm => (pm.option, pm.bestMatch))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def transition(c : Char) : Option[State] = {
    val t = c.toLowerCase
    if (possibleTransitions contains t) Some(transitions getOrElseUpdate (t, computeTransition(t))) else None
  }

  def apply(c : Char) : Option[State] = transition(c)

  def addOption(option : String) : Unit = {
    val w = words(option)
    addOption(option, w.size, List(("", w.head)), w)
  }

  def addOption(option : String, priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    options find (_.option == option) match {
      case Some(pM) => 
        if (!pM.hasMatchOption(matched)) {
          pM.addMatchOption(priority, matched, remaining)
          if (priority < pM.priority) {
            val (before, _ :: after) = options span (_ != pM)
            val (highPriority, lowPriority) = before span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
            _options = highPriority ::: (pM :: lowPriority) ::: after
          }
          transitions foreach (t => pM computeTransition (t._2, t._1))
        }
      case None => 
        val (highPriority, lowPriority) = options span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
        val newPM = new PossibleMatch(option, priority, matched, remaining)
        _options = highPriority ::: (newPM :: lowPriority)
        transitions foreach (t => newPM computeTransition (t._2, t._1))
    }
    computePossibleTransitions
  }

  def removeOption(option : String) : Unit = {
    options find (_.option == option) match {
      case Some(possibleMatch) =>
        val (before, _ :: after) = options span (_ != possibleMatch)
        (possibleMatch.possibleTransitions ** Set(transitions.keys.toList : _*)).toList foreach (t => 
          transition(t).get.removeOption(option))
        _options = before ::: after
        computePossibleTransitions
      case None =>
    }
  }
}

class PossibleMatch (val option : String, 
                     thisPriority : Int, 
                     matched : Matched, 
                     remaining : List[String]) {
  private var _priority = thisPriority
  private var matchOptions = List(new MatchOption(priority, matched, remaining))
  private var _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  private def computePossibleTransitions = {
    _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  }

  def priority : Int = _priority
  def hasMatchOption(matched : Matched) : Boolean = matchOptions exists (_.matched == matched)
  def addMatchOption(priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    if (priority < _priority) _priority = priority
    val (highPriority, lowPriority) = matchOptions span (_.priority < priority)
    val newMO = new MatchOption(priority, matched, remaining)
    matchOptions = highPriority ::: (newMO :: lowPriority)
    computePossibleTransitions
  }
  def bestMatch : Matched = matchOptions.head.matched.reverse.map(p => (p._1.reverse.toString, p._2)) ::: 
    remaining.tail.map(w => ("", w))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def computeTransition(s: State, c : Char) : Unit = {
    def computeOptions(state : State,
                       c : Char, 
                       priority : Int, 
                       matched : Matched, 
                       remaining : List[String]) : Unit = {
      remaining match {
        case w :: ws => 
          if (!w.isEmpty && w(0).toLowerCase == c.toLowerCase) {
            val newMatched = (w(0) + matched.head._1, matched.head._2.substring(1)) :: matched.tail
            val newPriority = if (matched.head._1 isEmpty) (priority - 1) else priority

            if (w.drop(1) isEmpty)
              s.addOption(option, newPriority - 1, ("", ws.head) :: newMatched , ws)
            else
              s.addOption(option, newPriority, newMatched, w.substring(1) :: ws)
          }
          if (ws != Nil) computeOptions(s, c, priority, ("", ws.head) :: matched, ws)
        case Nil =>
      }
    }

    if(possibleTransitions contains c)
      matchOptions foreach (mO => computeOptions(s, c, mO.priority, mO.matched, mO.remaining))
  }
}

class MatchOption (val priority : Int,
                   val matched : Matched,
                   val remaining : List[String]) {
  lazy val possibleTransitions : Set[Char] = Set( remaining map (_(0) toLowerCase) : _* )
}

It really needs some refactoring, though. I always do it when I'm start to explain it for the blog.

2 Comments

Interesting stuff, I don't need to match on combinations of prefixes at the moment, I'll keep it in mind.
Be grateful! I don't allow code that ugly on other's sight that easily. ;-)
2

Ok, I just realized what you want is pretty much what a friend of mine suggested for another problem. So, here is his answer, simplified for your needs.

class PrefixMatcher {
  // import scala.collection.Set // Scala 2.7 needs this -- and returns a gimped Set
  private var set = new scala.collection.immutable.TreeSet[String]()
  private def succ(s : String) = s.take(s.length - 1) + ((s.charAt(s.length - 1) + 1)).toChar

  def add(s: String) = set += s

  def findMatches(prefix: String): Set[String] = 
    if (prefix.isEmpty) set else set.range(prefix, succ(prefix))
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.