PySpark dataframe to view not working for created dataframe

Question

I'm trying to create a view for spark sql, but I'm having trouble creating it from a list of strings.

So I decided to follow the pyspark.sql document verbatim, and it still doesn't work:

testd = [{'name': 'Alice', 'age': 1}]
spark.createDataFrame(testd).collect()

Error trace:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-55-d4321f74b607> in <module>()
      1 testd = [{'name': 'Alice', 'age': 1}]
      2 
----> 3 spark.createDataFrame(testd).collect()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in collect(self)
    389         """
    390         with SCCallSiteSync(self._sc) as css:
--> 391             port = self._jdf.collectToPython()
    392         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    393 

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o896.collectToPython.
....
TypeError: range() integer end argument expected, got list.

Meanwhile, this in the tutorial:

l = [('Alice', 1)]
spark.createDataFrame(l, ['name', 'age']).collect()

Got the same essential error trace 'range() integer end argument expected, got list.'

What is going on here???

Here's how I initiate my spark instance:

os.environ['SPARK_HOME']='/path/to/spark2-client'
os.environ['PY4JPATH']='/path/to/spark2-client/python/lib/py4j-0.10.4-src.zip'
sys.path.insert(0, os.path.join(os.environ['SPARK_HOME'],'python'))
sys.path.insert(1, os.path.join(os.environ['SPARK_HOME'],'python/lib'))
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/conf'
os.environ['MASTER']="yarn"
os.environ['SPARK_MAJOR_VERSION']="2"
spark = (SparkSession
            .builder
            .appName('APPNAME')
            .config("spark.executor.instances","8")
            .config("spark.executor.memory","32g")
            .config("spark.driver.memory","64g")
            .config("spark.driver.maxResultSize","32g")
            .enableHiveSupport()
            .getOrCreate())

All other spark functions work fine, including hive queries, dataframe joining etc. Only when I try to create something from local memory, it doesn't work.

Thanks for any insights.

That looks like a version mismatch - probably, but not necessarily, related to all the funky path manipulation. I'd start with confirming that actually use versions you think you do, both locally (driver) and remotely. For the former you can use technique I described here. — 10465355
– 10465355, Commented Mar 15, 2019 at 14:37
@user10465355 Perhaps, but something has to be right for every other function to work just fine, but this one createDataFrame in particular. — Rocky Li
– Rocky Li, Commented Mar 15, 2019 at 14:38
No version is fully breaking the compatibility, a DataFrame API, with it's minimal dependency on Python code, has negligible failure surface. That's at least my best guess, as the error is not reproducible on proper deployments. — 10465355
– 10465355, Commented Mar 15, 2019 at 14:41

Jawad Ali Khan · Accepted Answer · 2019-03-15 14:31:14Z

-1

spark.createDataFrame(['Alice',1], ['name', 'age']).collect()

According to the documentation https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unionByName

answered Mar 15, 2019 at 14:31

Jawad Ali Khan

1

Sign up to request clarification or add additional context in comments.

1 Comment

Rocky Li Over a year ago

This does not work -- My question is that it's throwing that error trace when I tried stuff from the document.

Collectives™ on Stack Overflow

PySpark dataframe to view not working for created dataframe

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related