My sbt project has the following dependency configuration:
val scyllaConnector = "com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0"
val sparkHadoopCloud = "org.apache.spark" %% "spark-hadoop-cloud" % "3.3.0"
val sparkSql = "org.apache.spark" %% "spark-sql" % "3.3.0"
val sparkSqlKafka = "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.3.0"
val customService = // group, name and version of custom service
The customService includes a critical dependency:
val kubernetesClient = "io.kubernetes" % "client-java" % "19.0.1"
which requires Gson 2.10.1 specifically.
When running in Spark cluster mode, Gson 2.2.4 from Spark's jars is being used instead. I've confirmed this via:
Runtime path checking:
val path = classOf[Gson].getProtectionDomain.getCodeSource.getLocation.getPath log.info("Gson loaded from: " + path)
returns:
Gson loaded from: /opt/app-root/spark/jars/gson-2.2.4.jar
Error stack traces showing method signatures that only exist in 2.2.4
Exception refers in code lines, which take a place in 2.2.4 version of Gson methods, but not in 2.10.1
So the dependency three at runtime for gson may be so:
|-spark-sql -|
| |- gson-2.2.4
|
|-customService -|
|-- client-java -|
|-- gson-2.10.1
(As an interest, it does not match dependencyTreeBrowser output, in which all mentioned dependencies have 2.10.1 version)
As I understand, Spark by it's specific classloaders replace all gson version in project by it's own 2.2.4.
I tried to threat it by:
Clear all project dependencies from gson, with following add gson of target version.
val gson = "com.google.code.gson" % "gson" % "2.10.1" Seq( scyllaConnector, sparkHadoopCloud, sparkSql, sparkSqlKafka, customService ) .map(_=>ExclusionRule(organization = "com.google.code.gson", name = "gson")) ++ Seq(gson)Shade gson artifacts with target version:
lazy val shadingSettings = Seq( assembly / assemblyShadeRules := Seq( ShadeRule .rename("com.google.gson.**" -> "my_shaded.gson.@1") .inLibrary("com.google.code.gson" % "gson" % "2.10.1") .inAll, // Delete not shaded gson ShadeRule.zap("com.google.gson.**").inAll ), // Merge conflicted versions assembly / assemblyMergeStrategy := { case PathList("META-INF", "services", _*) => MergeStrategy.concat case PathList("META-INF", _*) => MergeStrategy.discard case _ => MergeStrategy.first } )
Neither approach works. What am I missing?
Additionally:
Full build.sbt
import com.eed3si9n.jarjarabrams.ShadeRule
import sbtassembly.AssemblyPlugin.autoImport.*
lazy val commonSettings = Seq(
tpolecatScalacOptions ++= Set(
ScalacOptions.release("11"),
ScalacOptions.warnOption("nonunit-statement")
)
)
lazy val shadingSettings = Seq(
assembly / assemblyShadeRules := Seq(
ShadeRule
.rename("com.google.gson.**" -> "my_shaded.gson.@1")
.inLibrary("com.google.code.gson" % "gson" % "2.10.1")
.inAll,
// Delete not shaded gson
ShadeRule.zap("com.google.gson.**").inAll
),
// Merge conflicted versions
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", _*) => MergeStrategy.concat
case PathList("META-INF", _*) => MergeStrategy.discard
case _ => MergeStrategy.first
}
)
inThisBuild(
Seq(
scalaVersion := "2.13.10",
javacOptions ++= Seq("-encoding","UTF-8","-source","11"),
Compile / compile / javacOptions ++= Seq( // need to split this 'cause javadoc :(
"-target",
"11",
"-Xlint:deprecation", // Print deprecation warning details.
),
autoAPIMappings := true, // will use external ScalaDoc links for managed dependencies
updateOptions := updateOptions.value.withCachedResolution(true),
addCompilerPlugin("com.olegpy" %% "better-monadic-for" % "0.3.1")
)
)
lazy val root =
Project(id = "project", base = file("."))
.enablePlugins(GitBranchPrompt, AssemblyPlugin)
.aggregate(
sparkJob
)
.settings(
commonSettings,
shadingSettings,
crossScalaVersions := Nil,
cleanFiles += baseDirectory.value / "dist"
)
lazy val sparkJob = Project(id = "events_loader", base =
file("events_loader"))
.configs(IntegrationTest)
.settings(
commonSettings,
shadingSettings,
inConfig(IntegrationTest)(Defaults.testSettings),
libraryDependencies ++= Dependencies.sparkJob.value
)
Dependencies.scala
object Dependencies {
object Version {
val ScyllaConnector = "3.2.0"
val Spark = "3.3.0"
val Gson = "2.10.1"
val HadoopAws = "3.3.2"
}
object Libraries {
val gson = "com.google.code.gson" % "gson" % Version.Gson
val scyllaConnector = "com.datastax.spark" %% "spark-cassandra-connector" % Version.ScyllaConnector
val sparkHadoopCloud =
("org.apache.spark" %% "spark-hadoop-cloud" % Version.Spark).exclude("org.apache.hadoop", "hadoop-aws")
val sparkHadoopAws = "org.apache.hadoop" % "hadoop-aws" % Version.HadoopAws % Provided
val sparkSql = "org.apache.spark" %% "spark-sql" % Version.Spark % Provided
val sparkSqlKafka = "org.apache.spark" %% "spark-sql-kafka-0-10" % Version.Spark
val customService = // group, name and version of custom service
}
val sparkJob = Def.setting {
import Libraries.*
Seq(
scyllaConnector,
sparkHadoopCloud,
sparkSql,
sparkSqlKafka,
sparkHadoopAws,
).map(_=>ExclusionRule(organization = "com.google.code.gson", name = "gson")) ++ Seq(gson)
}
}
Updated:
I tried to set classLoader priority by userClasspathFirst as following:
val conf = new SparkConf()
.setAppName(sparkConfig.appName)
.set("spark.driver.userClassPathFirst", "true")
.set("spark.executor.userClassPathFirst", "true")
But got an error:
org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.driver.userClassPathFirst. See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'
userClasspathFirst.