53

I wanted to convert the spark data frame to add using the code below:

from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

The detailed error message is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
      1 from pyspark.mllib.clustering import KMeans
      2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
      4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    842         if name not in self.columns:
    843             raise AttributeError(
--> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    845         jc = self._jdf.apply(name)
    846         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'map'

Does anyone know what I did wrong here? Thanks!

3
  • 1
    Keep in mind that MLLIB is built around RDDs while ML is generally built around dataframes. Since you appear to be using Spark 2.0, I would suggest you look up the KMeans from ML: spark.apache.org/docs/latest/ml-clustering.html Commented Sep 16, 2016 at 16:33
  • @JeffL: I checked ml, and I noticed that the input has to be dataset, not data frame. So we need to do another layer of conversion to convert data frame to dataset in order to use ml? Commented Sep 16, 2016 at 16:53
  • I'm not 100% clear on the distinction any more, though in Python I believe it's nearly moot. In fact if you browse the github code, in 1.6.1 the various dataframe methods are in a dataframe module, while in 2.0 those same methods are in a dataset module and there is no dataframe module. So I don't think you would face any conversion issues between dataframe and dataset, at least in the Python API. Commented Sep 16, 2016 at 17:01

2 Answers 2

111

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

Sign up to request clarification or add additional context in comments.

4 Comments

right on, this is one of the main changes in dataframes in spark 2.0
'RDD' object has no attribute 'collectAsList'
This has major downsides: "Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations." See my answer for more details and alternatives.
Fair that you should stick with DFs if possible and it should be possible nearly all of the time. Depending on what you're trying to do, there will be different techniques to accomplish that. But, there still may be use cases for converting to RDDs.
1

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))
    .show(false)

Result:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.