I'm trying to create a view for spark sql, but I'm having trouble creating it from a list of strings.
So I decided to follow the pyspark.sql document verbatim, and it still doesn't work:
testd = [{'name': 'Alice', 'age': 1}]
spark.createDataFrame(testd).collect()
Error trace:
Py4JJavaError Traceback (most recent call last)
<ipython-input-55-d4321f74b607> in <module>()
1 testd = [{'name': 'Alice', 'age': 1}]
2
----> 3 spark.createDataFrame(testd).collect()
/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in collect(self)
389 """
390 with SCCallSiteSync(self._sc) as css:
--> 391 port = self._jdf.collectToPython()
392 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
393
/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o896.collectToPython.
....
TypeError: range() integer end argument expected, got list.
Meanwhile, this in the tutorial:
l = [('Alice', 1)]
spark.createDataFrame(l, ['name', 'age']).collect()
Got the same essential error trace 'range() integer end argument expected, got list.'
What is going on here???
Here's how I initiate my spark instance:
os.environ['SPARK_HOME']='/path/to/spark2-client'
os.environ['PY4JPATH']='/path/to/spark2-client/python/lib/py4j-0.10.4-src.zip'
sys.path.insert(0, os.path.join(os.environ['SPARK_HOME'],'python'))
sys.path.insert(1, os.path.join(os.environ['SPARK_HOME'],'python/lib'))
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/conf'
os.environ['MASTER']="yarn"
os.environ['SPARK_MAJOR_VERSION']="2"
spark = (SparkSession
.builder
.appName('APPNAME')
.config("spark.executor.instances","8")
.config("spark.executor.memory","32g")
.config("spark.driver.memory","64g")
.config("spark.driver.maxResultSize","32g")
.enableHiveSupport()
.getOrCreate())
All other spark functions work fine, including hive queries, dataframe joining etc. Only when I try to create something from local memory, it doesn't work.
Thanks for any insights.
createDataFramein particular.DataFrameAPI, with it's minimal dependency on Python code, has negligible failure surface. That's at least my best guess, as the error is not reproducible on proper deployments.