3

I have a lot of line delimited json files in S3 and want to read all those files in spark and then read each line in the json and output a Dict/Row for that line with the filename as a column. How would I go about doing this in python in an efficient manner? Each json is approx 200 MB.

Here is an example of a file (there would be 200,000 rows like this), call this file class_scores_0219:

{"name": "Maria C", "class":"Math", "score":"80", "student_identification":22}
{"name": "Maria F", "class":"Physics", "score":"90", "student_identification":12}
{"name": "Fink", "class":"English", "score":"75", "student_identification":7}

The output DataFrame would be (for simplicity just showing one row):

+-------------------+---------+-------+-------+------------------------+
|     file_name     |  name   | class | score | student_identification |
+-------------------+---------+-------+-------+------------------------+
| class_scores_0219 | Maria C | Math  |    80 |                     22 |
+-------------------+---------+-------+-------+------------------------+

I have set the s3 secret key/ acesss key using this: sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY) (same thing for the access key), but can connect in a different way need be.

I am open to whatever option is the most efficient, I can supply the list of files and feed that in or I can connect to boto3 and supply a prefix. I am new to Spark so I appreciate all assistance.

1 Answer 1

5

You can achieve this by using spark itself.

Just add a new column with input_file_names and you will get your required result

from pyspark.sql.functions import input_file_name
df = spark.read.json(path_to_you_folder_conatining_multiple_files)
df = df.withColumn('fileName',input_file_name())

If you want to read multiple files you can pass them as list of files

files = [file1, file2, file3]
df = spark.read.json(*files)

Or if your list of files matches a wildcard then you can use it like below

df = spark.read.json('path/to/file/load2020*.json')

Or you can use boto3 to list all the object in the folder then create a list of required files and pass it to df.

Hope it helps.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks so much for your answer, the one thing that doesn't work with this is that in our S3 bucket there are certain files we ignore -> we would have to move only the files we want to use to a different S3 bucket if we wanted to use this option. Do you know if there's a workaround for this?
What's with the *files here? This causes an error for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.