7

I was trying to connect to MongoDB Atlas from PySpark and I have the following problem:

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

sc = SparkContext

spark = SparkSession.builder \
        .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
        .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
        .getOrCreate()

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

The error that returns this code is this:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-346df2de8d22> in <module>()
----> 1 df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
    170             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    171         else:
--> 172             return self._df(self._jreader.load())
    173 
    174     @since(1.4)

c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o34.load.
: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
    at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:50)
    at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:67)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.mongodb.client.model.Collation

How I can solve this problem?

Is a problem with the code or with the references?

In the pyspark config file, I have this:

./bin/pyspark --conf "spark.mongodb.input.uri=mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.1.3

The version of Spark is 2.3.1 and Scala 2.11.8

1 Answer 1

11

The problem was the missing JAR files in the $SPARK_HOME/jars folder:

  • org.mongodb.mongodb-driver:3.8.1
  • org.mongodb.mongodb-driver-core:3.8.1
  • org.mongodb.bson:3.8.1
Sign up to request clarification or add additional context in comments.

6 Comments

Can you share how did you write the references?
you have to add those jars to your spark/jars/ folder and it'll work
This helped me a lot! If anyone has facing 'classnotfound' or 'nosuchmethod' errors, while using pyspark, do this. If you have any other version of the above files, try to use those versions instead
thanks a alot , but how did you find these are the missing jar files , like where and how you get the hint to search for them ? could be helpful for future .
It's strange that Mongo doesn't ship these dependencies together with the connector jar.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.