0

I use Apache Commons Lang3's SerializationUtils in the code.

SerializationUtils.serialize()

to store a customized class as files into disk and

SerializationUtils.deserialize(byte[])

to restore them again.

In the local environment (Mac OS), all serialized files can be deserialized normally and no error happens. But when I copy these serialized files into HDFS, and read them from HDFS by using Spark/Scala, a SerializeException happens.

The Apache Commons Lang3 version is:

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.9</version>
    </dependency>

The deserialize code like this:

public static Block deserializeFrom(byte[] bytes) {
    try {
        Block b = SerializationUtils.deserialize(bytes);
        System.out.println("b="+b);
        return b;
    } catch (ClassCastException e) {
        System.out.println("ClassCastException");
        e.printStackTrace();
    } catch (IllegalArgumentException e) {
        System.out.println("IllegalArgumentException");
        e.printStackTrace();

    } catch (SerializationException e) {
        System.out.println("SerializationException");
        e.printStackTrace();
    }
    return null;
}

The Spark code is:

val fis = spark.sparkContext.binaryFiles("/folder/abc*.file")
val RDD = fis.map(x => {
  val content = x._2.toArray()    
  val b = Block.deserializeFrom(content)
  ...
}

All codes above can run successfully in Spark local mode, but when run it in Yarn cluster mode, an error happens. The stack error as below:

org.apache.commons.lang3.SerializationException: java.lang.ClassNotFoundException: com.XXXX.XXXX
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:227)
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:265)
    at com.com.XXXX.XXXX.deserializeFrom(XXX.java:81)
    at com.XXX.FFFF$$anonfun$3.apply(BXXXX.scala:157)
    at com.XXX.FFFF$$anonfun$3.apply(BXXXX.scala:153)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.XXXX.XXXX
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:686)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:223)

I've check the loaded byte[]'s length, both from local and from HDFS are same. But why it can not be deserialized from HDFS?

7
  • its not reproducable. moreover i strongly believe that your object which you are serializing is not supporting serialization its not able to resolve the class. Commented Jun 26, 2019 at 3:49
  • I've serialized object to files (stored in local disk), and deserialized them successfully. The operations codes in local and in HDFS are same. Event get the same length byte[]. But results are different. Commented Jun 26, 2019 at 4:05
  • interesting,.. i have used that api many times but never faced any issue. i serialized in to hbase and taken back same. i believe some mysterious thing is there in your hdfs serialization. Commented Jun 26, 2019 at 4:13
  • Actually, the serialized files in HDFS is copied from local disk, not be serialized to HDFS directly. I think the key point is java.lang.ClassNotFoundException, maybe there's something wrong about spark job. Commented Jun 26, 2019 at 4:19
  • try to directly serialize to hdfs since file system semantics are different it will work then... Commented Jun 26, 2019 at 4:27

1 Answer 1

0

This may be a classloader issue. Suppose your application is deployed to a Java server. The server will have loaded its own classes including library code it may need, for example SerializationUtils from Apache commons-lang3. When your application is deployed to it, the server may provide it with a separate classloader which inherits from the server's classloader. Let's call the server's classloader Cl-S and the deployed application's classloader Cl-A.

At some point the application wishes to deserialize an object from a byte[]. So it uses org.apache.commons.lang3.SerializationUtils. Cl-A is asked to provide that class. The first time around Cl-A won't have it so it has to load it in. But a classloader will commonly first ask its parent for a class before trying to load it by itself. Cl-A asks Cl-S if it happens to have SerializationUtils. If it does, it returns the class. Now the application can use it.

Things go wrong when you then perform the deserialization. The deserialize method is generic. This line

Block b = SerializationUtils.deserialize(bytes)

has the type Block inferred. The method will internally try to cast the deserialized Object to Block. But of course, to do so it must know class Block. When performing the method Java will go looking for that class. But for this it queries the classloader that loaded in SerializationUtils. This is Cl-S. Cl-S is the server's classloader, it has no knowledge of your application's Block class. so you get a ClassNotFoundException.

The classloader assigned to the application has access to your application's classes and its parent classloader's classes. The server classloader can't go in the other direction, it can't get classes from your application. Application servers, like Java EE ones (Wildfly, Glassfish etc.) typically use this to allow multiple applications to run in the server but remain separated, or to implement a module system so certain modules can be shared across applications to reduce their size and memory footprint.

Serializing and deserializing objects in Java is simple. Just do it yourself or write a couple methods for it rather than dragging in a library that opens you up to opaque issues like this, version conflicts and bloat.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.