0

I'm trying to create a view for spark sql, but I'm having trouble creating it from a list of strings.

So I decided to follow the pyspark.sql document verbatim, and it still doesn't work:

testd = [{'name': 'Alice', 'age': 1}]
spark.createDataFrame(testd).collect()

Error trace:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-55-d4321f74b607> in <module>()
      1 testd = [{'name': 'Alice', 'age': 1}]
      2 
----> 3 spark.createDataFrame(testd).collect()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in collect(self)
    389         """
    390         with SCCallSiteSync(self._sc) as css:
--> 391             port = self._jdf.collectToPython()
    392         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    393 

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o896.collectToPython.
....
TypeError: range() integer end argument expected, got list.

Meanwhile, this in the tutorial:

l = [('Alice', 1)]
spark.createDataFrame(l, ['name', 'age']).collect()

Got the same essential error trace 'range() integer end argument expected, got list.'

What is going on here???

Here's how I initiate my spark instance:

os.environ['SPARK_HOME']='/path/to/spark2-client'
os.environ['PY4JPATH']='/path/to/spark2-client/python/lib/py4j-0.10.4-src.zip'
sys.path.insert(0, os.path.join(os.environ['SPARK_HOME'],'python'))
sys.path.insert(1, os.path.join(os.environ['SPARK_HOME'],'python/lib'))
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/conf'
os.environ['MASTER']="yarn"
os.environ['SPARK_MAJOR_VERSION']="2"
spark = (SparkSession
            .builder
            .appName('APPNAME')
            .config("spark.executor.instances","8")
            .config("spark.executor.memory","32g")
            .config("spark.driver.memory","64g")
            .config("spark.driver.maxResultSize","32g")
            .enableHiveSupport()
            .getOrCreate())

All other spark functions work fine, including hive queries, dataframe joining etc. Only when I try to create something from local memory, it doesn't work.

Thanks for any insights.

3
  • That looks like a version mismatch - probably, but not necessarily, related to all the funky path manipulation. I'd start with confirming that actually use versions you think you do, both locally (driver) and remotely. For the former you can use technique I described here. Commented Mar 15, 2019 at 14:37
  • @user10465355 Perhaps, but something has to be right for every other function to work just fine, but this one createDataFrame in particular. Commented Mar 15, 2019 at 14:38
  • 1
    No version is fully breaking the compatibility, a DataFrame API, with it's minimal dependency on Python code, has negligible failure surface. That's at least my best guess, as the error is not reproducible on proper deployments. Commented Mar 15, 2019 at 14:41

1 Answer 1

-1

spark.createDataFrame(['Alice',1], ['name', 'age']).collect()

According to the documentation https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unionByName

Sign up to request clarification or add additional context in comments.

1 Comment

This does not work -- My question is that it's throwing that error trace when I tried stuff from the document.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.