I have a lot of line delimited json files in S3 and want to read all those files in spark and then read each line in the json and output a Dict/Row for that line with the filename as a column. How would I go about doing this in python in an efficient manner? Each json is approx 200 MB.
Here is an example of a file (there would be 200,000 rows like this), call this file class_scores_0219:
{"name": "Maria C", "class":"Math", "score":"80", "student_identification":22}
{"name": "Maria F", "class":"Physics", "score":"90", "student_identification":12}
{"name": "Fink", "class":"English", "score":"75", "student_identification":7}
The output DataFrame would be (for simplicity just showing one row):
+-------------------+---------+-------+-------+------------------------+
| file_name | name | class | score | student_identification |
+-------------------+---------+-------+-------+------------------------+
| class_scores_0219 | Maria C | Math | 80 | 22 |
+-------------------+---------+-------+-------+------------------------+
I have set the s3 secret key/ acesss key using this: sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
(same thing for the access key), but can connect in a different way need be.
I am open to whatever option is the most efficient, I can supply the list of files and feed that in or I can connect to boto3 and supply a prefix. I am new to Spark so I appreciate all assistance.