Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

Question

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.

I am able to load files from HDFS as easy as this:

text_file = sc.textFile("/user/myname/student_grades.txt")

And I´m able to write output like this:

text_file.saveAsTextFile("/user/myname/student_grades2.txt")

The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:

list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']

for i in list:
    text_file = sc.textFile("/user/myname/" + i)
    text_file.saveAsTextFile("/user/myname/all.txt")

So this works for the first element of the list, but then gives me this error message:

Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
XXXXXXXX/user/myname/all.txt already exists

To avoid confusion I "blured"-out the IP address with XXXXXXXX.

What is the right way to do this? I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.

Thanks a lot!
MG

EDIT: It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:

for i in list:
    text_file = sc.textFile("/user/myname/" + i)
    text_file = really_cool_python_function(text_file)
    text_file.saveAsTextFile("/user/myname/all.txt")

Community · Accepted Answer · 2017-05-23 12:10:08Z

I wanted to post this as comment but could not do so as I do not have enough reputation.

You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes

Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.

There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.

subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)

Suresh · Accepted Answer · 2017-05-11 15:52:32Z

0

you can read multiple files and save them by

textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')

you will get all part files within output directory.

answered May 11, 2017 at 15:52

Suresh

5,8802 gold badges27 silver badges42 bronze badges

4 Comments

bootica Over a year ago

It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. See the EDIT

Suresh Over a year ago

same function for all text files ?

bootica Over a year ago

Yes same function for all files, but I cant join the text files before because each files needs to be treated seperately

Suresh Over a year ago

columns in all the files will be similar or different??

Henry · Accepted Answer · 2017-05-15 07:33:54Z

0

If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.

answered May 15, 2017 at 7:33

Henry

1,68613 silver badges28 bronze badges

Comments

Eimis Pacheco · Accepted Answer · 2021-06-05 00:41:56Z

0

I would try this, it should be fine:

   list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
    
    for i in list:
        text_file = sc.textFile("/user/myname/" + i)
    text_file.saveAsTextFile(f"/user/myname/{i}")

answered Jun 5, 2021 at 0:41

Eimis Pacheco

1033 silver badges13 bronze badges

Collectives™ on Stack Overflow

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related