1

I have a dataframe "df" with the columns ['name', 'age'] I saved the dataframe using df.rdd.saveAsTextFile("..") to save it as an rdd. I loaded the saved file and then collect() gives me the following result.

a = sc.textFile("\mee\sample")
a.collect()
Output:
    [u"Row(name=u'Alice', age=1)",
     u"Row(name=u'Alice', age=2)",
     u"Row(name=u'Joe', age=3)"]

This is not an rdd of Rows.

a.map(lambda g:g.age).collect()
AttributeError: 'unicode' object has no attribute 'age'

Is there any way to save the dataframe as a normal rdd without column names and Row keywords? I want to save the dataframe so that on loading the file and collect should give me as follows:

a.collect()   
[(Alice,1),(Alice,2),(Joe,3)]

2 Answers 2

1

It is a normal RDD[Row]. Problem is you that when you saveAsTextFile and load with textFile what you get is a bunch of strings. If you want to save objects you should use some form of serialization. For example pickleFile:

from pyspark.sql import Row

df = sqlContext.createDataFrame(
   [('Alice', 1), ('Alice', 2), ('Joe', 3)],
   ("name", "age")
)

df.rdd.map(tuple).saveAsPickleFile("foo")
sc.pickleFile("foo").collect()

## [('Joe', 3), ('Alice', 1), ('Alice', 2)]
Sign up to request clarification or add additional context in comments.

2 Comments

Yeah but how would you load that pickle file back to a spark df ?
@bluerubez OP doesn't want DataFrame back. There better formats if you want to serialize DataFrame, although tuples can work as well.
0

I think you can do like this:

a.map(lambda x:(x[0],x[1])).collect()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.