2

I'd like to build a structure that links a regex pattern to a description of a feature within some text.

Example: "^.* horses .$" maps to 'horses'; "^. pigs .*$" maps to 'pigs' and so on

There are thousands of possible descriptions for this text, so grouping a compiled regex pattern w/ its description would allow me to search efficiently. Below is the key part of my code:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{Encoder, Encoders}
import scala.util.matching.Regex

object GlueApp {
    case class RegexMetadata(regexName: String, pattern: scala.util.matching.Regex)
    def main(sysArgs: Array[String]) {
      val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
      val sc: SparkContext = new SparkContext()
      val glueContext: GlueContext = new GlueContext(sc)
      val spark = glueContext.getSparkSession
      import spark.implicits._
      Job.init(args("JOB_NAME"), glueContext, args.asJava)

      implicit val regexEncoder = Encoders.kryo[scala.util.matching.Regex]
      implicit val regexMetadataEncoder = Encoders.product[RegexMetadata]
      Job.commit()
}

}

When I run this, I get the following error: java.lang.UnsupportedOperationException No Encoder found for scala.util.matching.Regex

It compiles and runs fine when I don't have the "implicit val regexMetadataEncoder" line. This seems to work on Databricks, but not on AWS Glue.

Some searching found these similar questions, but I can't solve my problem w/ them:

scala generic encoder for spark case class

Thank you for your help!

Spark 2.x scala 2.1.1 custom encoder class type mismatch

1 Answer 1

1

Got it working. I had an issue with declaring my encoders correctly. Below is the key section of my working code:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import scala.collection.mutable.WrappedArray
import java.util.regex.Pattern
object GlueApp {
    /* RegexConfig -- maps a regex pattern string to a value */
    case class RegexConfig(value: String, patternRegex: String)
    /* RegexMetadata -- maps a compiled regex pattern to a regex config */
    case class RegexMetadata(config: RegexConfig, pattern: java.util.regex.Pattern)
    def main(sysArgs: Array[String]) {
        val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
        val sc: SparkContext = new SparkContext()
        val glueContext: GlueContext = new GlueContext(sc)
        val spark = glueContext.getSparkSession
        import spark.implicits._
        Job.init(args("JOB_NAME"), glueContext, args.asJava)
        implicit val regexMetadataEncoder = Encoders.kryo[RegexMetadata]
        val regexEncoder = Encoders.product[RegexConfig]
        << read file w/ regex patterns and put into regexConfigArray >>
        val regexLocal = for (config <- regexConfigArray) yield 
            RegexMetadata(config, Pattern.compile(config.patternRegex, 
                               Pattern.CASE_INSENSITIVE))
        Job.commit()
    }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.