1

I want to call the main method of the FAT jar from Pyspark.

Here is the entrypoint of the main method of the jar (Scala):

object Main {
  def main(args: Array[String]): Unit = {
    // codes
  }
}

In order to invoke above method, I need to create an Array[String] using pyspark's py4j:

str_array = sc._jvm.java.lang.reflect.Array.newInstance(sc._jvm.java.lang.String, 3)
str_array[0] = "228"

loaded_class = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("com.mycompany.Main")
loaded_class.main(str_array)

And this is the error I get:

Py4JError: java.lang.String._get_object_id does not exist in the JVM

With plain Py4j, I could have created string array using:

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
gateway.new_array(gateway.jvm.java.lang.String, 4)

I tried to pass object array to main, but that didn't work:

ob = sc._jvm.java.lang.Object()
ob_array = sc._jvm.java.lang.reflect.Array.newInstance(ob.getClass(), 3)
ob_array[0] = "228"
loaded_class = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("com.mycompany.Main")
loaded_class.main(ob_array)

fails with error:

Py4JError: An error occurred while calling o516.main. Trace:
py4j.Py4JException: Method main([class [Ljava.lang.Object;]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

How do I create the string array to invoke main method in PySpark ?

2 Answers 2

2

To convert some python list of strings (arr) to java string array in pyspark you can use the function toJStringArray bellow:

def toJStringArray(arr):
    jarr = sc._gateway.new_array(sc._jvm.java.lang.String, len(arr))
    for i in range(len(arr)):
        jarr[i] = arr[i]
    return jarr

# Usage example:
java_arr = toJStringArray(['string1'])

I run this code in the azure databricks. So the sc is a predefined global for the spark context.

Sign up to request clarification or add additional context in comments.

Comments

1

Similar to the plain Py4j, can you try creating it using :

str_array = sc._jvm._gateway.new_array(sc._jvm.java.lang.String, 4)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.