4

I have a dataframe with 1000+ columns. I need to save this dataframe as .txt file(not as .csv) with no header,mode should be "append"

used below command which is not working

df.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")

error i got

pyspark.sql.utils.AnalysisException: 'Text data source supports only a single column,

Note: Should not use RDD to save. Becouse i need to save files multiple times in the same path.

3
  • In addition to what you tried, you could mention what error you get Commented Mar 23, 2018 at 11:07
  • i have updated the question Commented Mar 23, 2018 at 11:17
  • What is your desired output? Do you want spaces instead of commas? Commented Mar 23, 2018 at 14:46

3 Answers 3

4

If you want to write out a text file for a multi column dataframe, you will have to concatenate the columns yourself. In the example below I am separating the different column values with a space and replacing null values with a *:

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([("foo", "bar"), ("baz", None)], 
                            ('a', 'b'))

def myConcat(*cols):
    concat_columns = []
    for c in cols[:-1]:
        concat_columns.append(F.coalesce(c, F.lit("*")))
        concat_columns.append(F.lit(" "))  
    concat_columns.append(F.coalesce(cols[-1], F.lit("*")))
    return F.concat(*concat_columns)

df_text = df.withColumn("combined", myConcat(*df.columns)).select("combined")

df_text.show()

df_text.coalesce(1).write.format("text").option("header", "false").mode("append").save("output.txt")

This gives as output:

+--------+
|combined|
+--------+
| foo bar|
|   baz *|
+--------+

And your output file should look likes this

foo bar
baz *
Sign up to request clarification or add additional context in comments.

1 Comment

thank you for this! what about concatenating column names though?
3

You can concatenate the columns easily using the following line (assuming you want a positional file and not a delimited one, using this method for a delimited file would require that you had delimiter columns between each data column):

dataFrameWithOnlyOneColumn = dataFrame.select(concat(*dataFrame.columns).alias('data'))

After concatenating the columns, your previous line should work just fine:

dataFrameWithOnlyOneColumn.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")

Comments

0

You could also transform pyspark dataframe to pandas and then save it to file. Something like this:

df_pyspark = spark.createDataFrame(data, schema=columns)

head_rows = df.toPandas()

string_representation = head_rows.to_string(index=False)

with open("file_name.txt", "w") as file:
    file.write(string_representation)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.