0

I imported data using sqoop in a sequence file and I am loading that data using spark-shell. The generated code from spark has references to classes in com.cloudera.sqoop.lib package. Running the command in spark-shell generates the following error:

  val ordersRDD = sc.sequenceFile("/user/pawinder/problem1-seq/orders",classOf[org.apache.hadoop.io.IntWritable],classOf[com.problem1.retaildb.orders])
    warning: Class com.cloudera.sqoop.lib.SqoopRecord not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.LargeObjectLoader not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.LargeObjectLoader not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.RecordParser not found - continuing with a stub.
    error: Class com.cloudera.sqoop.lib.SqoopRecord not found - continuing with a stub.

Can I instruct sqoop to generate the code without having a dependency on cloudera package? Do I need to add the jar file having com.cloudera.sqoop.lib package while starting spark-shell? Where can I find the jar file? Should I write the code for the value class so that it does not have dependency on com.cloudera.sqoop.lib package?

I am using cloudera quickstart vm. Many thanks for your help.

EDIT: The issue is resolved by adding sqoop-1.4.6.2.6.5.0-292.jar to spark2-shell

 spark-shell --jars problem1/bin/orders.jar,/usr/hdp/2.6.5.0-292/sqoop/sqoop-1.4.6.2.6.5.0-292.jar

I tried to resolve this by defining a case class for Orders, but that did not work. The MapReduce job still had a reference to com.cloudera.sqoop package classes

scala> case class Orders(order_id:Int,order_date:java.sql.Timestamp,customer_id:Int,status:String)
defined class Orders
scala> val ordersRDD = sc.sequenceFile("/user/pawinder/problem1-seq/orders",classOf[org.apache.hadoop.io.IntWritable],classOf[Orders])
 ordersRDD: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.IntWritable, Orders)] = /user/pawinder/problem1-seq/orders HadoopRDD[0] at sequenceFile at <console>:26

scala> ordersRDD.count
    19/05/14 14:29:21 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
    java.lang.NoClassDefFoundError: com/cloudera/sqoop/lib/SqoopRecord
3
  • can you tell me how to do same in pyspark Commented Jan 29, 2020 at 14:20
  • Check this link to add jar files to pyspark. stackoverflow.com/questions/27698111/… Commented Jan 31, 2020 at 14:29
  • Even though iam adding jar files , when i am entering class name, in value place i.e. while reading sequence file. pyspark still throwing error class name not defined. Commented Jan 31, 2020 at 15:01

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.