0

My dataframe output is as below,
DF.show(2)

+--------------+  
|col1|col2|col3|  
+--------------+  
|  10|  20|  30|  
|  11|  21|  31|  
+--------------+ 

after saving it as textfile - DF.rdd.saveAsTextFile("path")

Row(col1=u'10', col2=u'20', col3=u'30')  
Row(col1=u'11', col2=u'21', col3=u'31')  

the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes

10|20|30  
11|21|31 

while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,

data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))  

Thanks in advance !

2 Answers 2

1

I think you can do just

.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you @PeterK, this is working for this example DF but my actual DF contains millions of rows and 20 columns,,,how can i do this for actual DF?
Sorry, i am able to run this for my actual DF, while trying initially i was facing issue - SyntaxError: Non-ASCII character '\xe2' in file, This link helped me
@hadoop491 if you don't want to specify all columns you can try: .map(lambda x: '|'.join(map(str,x)))
0

In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

So in you're case, could just do something like

df.write.option("sep", "|").option("header", "false").csv("some/path/")

There is a databricks plugin that provides this functionality in spark 1.x

https://github.com/databricks/spark-csv

As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

4 Comments

Thank you @Bradley Kaiser and Is there any possibility for spark 1.x ?
There is a databricks plugin for spark 1.x that provides the same functionality. Oops I meant to mention that above.
i tried that as ./pyspark --packages com.databricks:spark-csv_2.11:1.5.0 but it is unable to get it with error "Java gateway process exited before sending the driver its port number", i think it is some sort of organisation network blocking, can i download it and place it some library folder?
Yeah you could definitely do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.