I am trying to read a collection in MongoDB in Spark with 234 million records. I want only 1 field.
case class Linkedin_Profile(experience : Array[Experience])
case class Experience(company : String)
val rdd = MongoSpark.load(sc, ReadConfig(Map("uri" -> mongo_uri_linkedin)))
val company_DS = rdd.toDS[Linkedin_Profile]()
val count_udf = udf((x: scala.collection.mutable.WrappedArray[String]) => {x.filter( _ != null).groupBy(identity).mapValues(_.size)})
val company_ColCount = company_DS.select(explode(count_udf($"experience.company")))
comp_rdd.saveAsTextFile("/dbfs/FileStore/chandan/intermediate_count_results.csv")
The job runs for 1 hour with half of jobs completed but after that it gives an error
com.mongodb.MongoCursorNotFoundException:
Query failed with error code -5 and error message
'Cursor 8962537864706894243 not found on server cluster0-shard-01-00-i7t2t.mongodb.net:37017'
on server cluster0-shard-01-00-i7t2t.mongodb.net:37017
I tried changing the configuration with below, but to no avail.
System.setProperty("spark.mongodb.keep_alive_ms", "7200000")
Please suggest how to read this large collection.