Spark - How to Read Multiple Multiple Json Files With Filename From S3

Question

I have a lot of line delimited json files in S3 and want to read all those files in spark and then read each line in the json and output a Dict/Row for that line with the filename as a column. How would I go about doing this in python in an efficient manner? Each json is approx 200 MB.

Here is an example of a file (there would be 200,000 rows like this), call this file class_scores_0219:

{"name": "Maria C", "class":"Math", "score":"80", "student_identification":22}
{"name": "Maria F", "class":"Physics", "score":"90", "student_identification":12}
{"name": "Fink", "class":"English", "score":"75", "student_identification":7}

The output DataFrame would be (for simplicity just showing one row):

+-------------------+---------+-------+-------+------------------------+
|     file_name     |  name   | class | score | student_identification |
+-------------------+---------+-------+-------+------------------------+
| class_scores_0219 | Maria C | Math  |    80 |                     22 |
+-------------------+---------+-------+-------+------------------------+

I have set the s3 secret key/ acesss key using this: sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY) (same thing for the access key), but can connect in a different way need be.

I am open to whatever option is the most efficient, I can supply the list of files and feed that in or I can connect to boto3 and supply a prefix. I am new to Spark so I appreciate all assistance.

Shubham Jain · Accepted Answer · 2020-05-05 17:29:01Z

5

You can achieve this by using spark itself.

Just add a new column with input_file_names and you will get your required result

from pyspark.sql.functions import input_file_name
df = spark.read.json(path_to_you_folder_conatining_multiple_files)
df = df.withColumn('fileName',input_file_name())

If you want to read multiple files you can pass them as list of files

files = [file1, file2, file3]
df = spark.read.json(*files)

Or if your list of files matches a wildcard then you can use it like below

df = spark.read.json('path/to/file/load2020*.json')

Or you can use boto3 to list all the object in the folder then create a list of required files and pass it to df.

Hope it helps.

edited May 5, 2020 at 17:29

answered May 5, 2020 at 5:45

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

WIT Over a year ago

Thanks so much for your answer, the one thing that doesn't work with this is that in our S3 bucket there are certain files we ignore -> we would have to move only the files we want to use to a different S3 bucket if we wanted to use this option. Do you know if there's a workaround for this?

bc3tech Over a year ago

What's with the *files here? This causes an error for me.

Collectives™ on Stack Overflow

Spark - How to Read Multiple Multiple Json Files With Filename From S3

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related