AttributeError: 'DataFrame' object has no attribute 'map'

Question

I wanted to convert the spark data frame to add using the code below:

from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

The detailed error message is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
      1 from pyspark.mllib.clustering import KMeans
      2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
      4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    842         if name not in self.columns:
    843             raise AttributeError(
--> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    845         jc = self._jdf.apply(name)
    846         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'map'

Does anyone know what I did wrong here? Thanks!

Keep in mind that MLLIB is built around RDDs while ML is generally built around dataframes. Since you appear to be using Spark 2.0, I would suggest you look up the KMeans from ML: spark.apache.org/docs/latest/ml-clustering.html — Jeff
– Jeff, Commented Sep 16, 2016 at 16:33
@JeffL: I checked ml, and I noticed that the input has to be dataset, not data frame. So we need to do another layer of conversion to convert data frame to dataset in order to use ml? — Edamame
– Edamame, Commented Sep 16, 2016 at 16:53
I'm not 100% clear on the distinction any more, though in Python I believe it's nearly moot. In fact if you browse the github code, in 1.6.1 the various dataframe methods are in a dataframe module, while in 2.0 those same methods are in a dataset module and there is no dataframe module. So I don't think you would face any conversion issues between dataframe and dataset, at least in the Python API. — Jeff
– Jeff, Commented Sep 16, 2016 at 17:01

David · Accepted Answer · 2016-09-16 16:28:47Z

111

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

answered Sep 16, 2016 at 16:28

David

11.6k4 gold badges44 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mostafa Over a year ago

right on, this is one of the main changes in dataframes in spark 2.0

Joe Over a year ago

'RDD' object has no attribute 'collectAsList'

Yosi Dahari Over a year ago

This has major downsides: "Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations." See my answer for more details and alternatives.

David Over a year ago

Fair that you should stick with DFs if possible and it should be possible nearly all of the time. Depending on what you're trying to do, there will be different techniques to accomplish that. But, there still may be use cases for converting to RDDs.

Yosi Dahari · Accepted Answer · 2021-02-20 13:59:56Z

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))
    .show(false)

Result:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Collectives™ on Stack Overflow

AttributeError: 'DataFrame' object has no attribute 'map'

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related