Remove column names from spark dataframe while storing it as textfile

Question

My dataframe output is as below,
DF.show(2)

+--------------+  
|col1|col2|col3|  
+--------------+  
|  10|  20|  30|  
|  11|  21|  31|  
+--------------+

after saving it as textfile - DF.rdd.saveAsTextFile("path")

Row(col1=u'10', col2=u'20', col3=u'30')  
Row(col1=u'11', col2=u'21', col3=u'31')

the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes

10|20|30  
11|21|31

while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,

data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))

Thanks in advance !

Peter Krejzl · Accepted Answer · 2017-02-02 19:56:38Z

1

I think you can do just

.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)

answered Feb 2, 2017 at 19:56

Peter Krejzl

608 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user491 Over a year ago

Thank you @PeterK, this is working for this example DF but my actual DF contains millions of rows and 20 columns,,,how can i do this for actual DF?

user491 Over a year ago

Sorry, i am able to run this for my actual DF, while trying initially i was facing issue - SyntaxError: Non-ASCII character '\xe2' in file, This link helped me

Peter Krejzl Over a year ago

@hadoop491 if you don't want to specify all columns you can try: .map(lambda x: '|'.join(map(str,x)))

Community · Accepted Answer · 2017-05-23 12:01:39Z

0

In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

So in you're case, could just do something like

df.write.option("sep", "|").option("header", "false").csv("some/path/")

There is a databricks plugin that provides this functionality in spark 1.x

https://github.com/databricks/spark-csv

As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Feb 2, 2017 at 19:28

Bradley Kaiser

7744 silver badges16 bronze badges

4 Comments

user491 Over a year ago

Thank you @Bradley Kaiser and Is there any possibility for spark 1.x ?

Bradley Kaiser Over a year ago

There is a databricks plugin for spark 1.x that provides the same functionality. Oops I meant to mention that above.

user491 Over a year ago

i tried that as ./pyspark --packages com.databricks:spark-csv_2.11:1.5.0 but it is unable to get it with error "Java gateway process exited before sending the driver its port number", i think it is some sort of organisation network blocking, can i download it and place it some library folder?

Bradley Kaiser Over a year ago

Yeah you could definitely do that.

Collectives™ on Stack Overflow

Remove column names from spark dataframe while storing it as textfile

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related