0

I am having this error in Jupyter notebook running Python 3.6.5, and in my Python shell running 3.7.2. My OS is Windows 10. I did pip install pyspark in both environments. Both are using Spark version 2.4.0, and my Java JDK is Oracle JDK version 8, jdk1.8.0_201. This is the code I'm running in both cases:

>>> from pyspark import SparkConf, SparkContext
>>> conf = SparkConf().setAppName("app")
>>> sc = SparkContext(conf=conf)
>>> import os

>>> os.chdir("C:/Users/theca/Desktop/school_folders/Big Data")
>>> data = sc.textFile("post_codes.txt")
>>> data.take(1)

I was using JRE version 8, I verified JAVA_HOME:

C:\Python\Python37\Scripts>echo %JAVA_HOME%

C:\ProgramData\Oracle\Java\javapath\java.exe

I Changed to JDK thinking that would fix the issue:

C:\Program Files\Java\jdk1.8.0_201>setx JAVA_HOME "C:\Program Files\Java\jdk1.8.0_201"

SUCCESS: Specified value was saved.

C:\Program Files\Java\jdk1.8.0_201>setx PATH "%PATH%;%JAVA_HOME%\bin";

WARNING: The data being saved is truncated to 1024 characters.

I exited cmd and went back in, verified my java home:

C:\WINDOWS\system32>echo %JAVA_HOME% C:\Program Files\Java\jdk1.8.0_201

I have tried solutions here: PySpark exception: Java gateway process exited before sending its port number

and here: Pyspark: SparkContext definition in Spyder throws Java gateway error

As well as a few other answers in this board.I am wondering if I may need to use an earlier version of spark? Here is the entirety of the error message:

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    data.take(1)
  File "C:\Python\Python37\lib\site-packages\pyspark\rdd.py", line 1327, in take
    totalParts = self.getNumPartitions()
  File "C:\Python\Python37\lib\site-packages\pyspark\rdd.py", line 391, in getNumPartitions
    return self._jrdd.partitions().size()
  File "C:\Python\Python37\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Python\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Python/Python37/post_codes.txt

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)

at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)

at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.lang.Thread.run(Thread.java:748)
7
  • Hmm, this is weird: Input path does not exist: file:/C:/Python/Python37/post_codes.txt, I assume you've tried with data = sc.textFile("absolute/path/to/post_codes.txt")? Commented Mar 5, 2019 at 6:22
  • @mkaran I'm not sure what you mean, but if you are asking if I have tried the direct path as opposed to os.chdir, then yes I have and the outcome is the same. Commented Mar 5, 2019 at 6:24
  • 1
    yes, that's what I meant because it seems like this is the main exception, thanks for the answer. Commented Mar 5, 2019 at 6:30
  • well it looks that os.chdir is ignored anyway! or you might have an invalid path! I dont like this space on "Big Data" for instance :) I would try with absolute path first and different folder e.g: sc.textFile("C:/Users/theca/Desktop/school_folders/Big_Data/post_codes.txt") Commented Mar 5, 2019 at 16:35
  • @AlexandrosBiratsis, I just created an RDD by entering manually: data_heterogeneous = sc.parallelize([('Ferrari','fast'), {"Porche": 100000}, ["Spain", "visited", 4504]]).collect() I still get a py4jjava error after doing sc.take(1). Commented Mar 5, 2019 at 21:02

2 Answers 2

1

Try the following:

data = sc.textFile("file:///path to the file/")

This should work.

Sign up to request clarification or add additional context in comments.

1 Comment

It doesn't work for me. I get the py4jj error on an RDD I create as well, i.e., not a file I read in. Thanks anyway, though.
0

Also on windows 10, this worked for me:

sc = SparkContext(conf = conf)
sc.textFile("file/C:\\SparkCourse\\file.csv")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.