2

I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).

I started with the following implementation:

package example

import scala.util.parsing.combinator._

object Example extends RegexParsers {
  override def skipWhitespace = false

  def character: Parser[String] = """\w""".r

  def word: Parser[String] =
    rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))

  def sentence: Parser[List[String]] = rep(word) <~ "."
}

object Test extends App {
  val result = Example.parseAll(Example.sentence, "I like Scala.")

  println(result)
}

The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).

If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.

def word: Parser[String] =
  rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))

def sentence: Parser[List[String]] = rep(word) <~ opt(".")

Any ideas what may be going on here?

1 Answer 1

2

However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).

The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.

However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:

object Example extends RegexParsers {
  override def skipWhitespace = false

  def word: Parser[String] = """\w+""".r

  def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}

Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. As for simplifying word, you are right that in the example, your solution makes more sense. The original problem I was trying to solve has a bit more complex domain, where the equivalent of character is a bit more complex and requires specifying its own parser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.