27

I want to perform k-fold cross validation using pyspark to finetune the parameters and I'm using pyspark.ml. I am getting Attribute Error.

AttributeError: 'DataFrame' object has no attribute '_jdf'

I have tried initially using pyspark.mllib but was not able to succeed in performing k-fold cross validation

import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.ml.classification import DecisionTreeClassifier

data=pd.read_csv("file:///SparkCourse/wdbc.csv", header=None)
type(data)
print(data)

conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")
sc = SparkContext(conf = conf)

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", 
maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(data)

# I expect the model to be trained but I'm getting the following error 
AttributeError: 'DataFrame' object has no attribute '_jdf'

Note: I'm able to print the data. Error is in dtModel

3
  • 8
    You will need to convert the pandas dataframe to a spark dataframe Commented Apr 10, 2019 at 6:22
  • Ill try doing that. Thank you Commented Apr 11, 2019 at 19:03
  • In case it helps someone. This error can also be thrown if you've converted the DataFrame to pandas for display after loading it. For example, by using df.limit(5).toPandas(). Commented Jan 26, 2022 at 20:11

4 Answers 4

19

Convert Panadas to Spark

from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

spark_dff = sqlContext.createDataFrame(panada_df)
Sign up to request clarification or add additional context in comments.

Comments

1

I just want to share my experience with this error here. In my case, I had a loop and in some iterations the dataset was just a string because it was empty. When I handle the empty datsts using a 'if' my problem solved. Thanks

1 Comment

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
0

If a metric evaluation error you probably:

  1. Transformed using Spark on test set properly, then peeked using Pandas DF.
# Spark model, transformed test, converted to pandas df
predictions = model.transform(test)
predDF = predictions.toPandas()
predDF.head()
  1. Then tried:
eval_acc = MulticlassClassificationEvaluator(
            labelCol='Label_index',
            predictionCol='prediction',
            metricName='accuracy'
)

# Evaluate Performance
acc = eval_acc.evaluate(predDF) # Error
print(f"accuracy: {acc}")

I forgot predDF is a Pandas DataFrame. Needed predictions because its a Spark Dataframe.

acc = eval_acc.evaluate(predictions) # Works
print(f"accuracy: {acc}")

Comments

0

I think it's because you need to use: spark.read, try this:

data = spark.read.option("header", True).csv(
 "file:///SparkCourse/wdbc.csv"
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.