PySpark: Read multiple XML files (list of s3 paths) in Spark dataframe

Question

As the question suggests, I have a list of s3 paths in a list

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

I'm using PySpark and want to find how I can load all these XML files in dataframe together? Something similar to the example shown below.

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(s3_paths)

I'm able to read a single file but want to find the best way to load all files.

How do you start pyspark to execute the spark.read.format code? — Jacek Laskowski
– Jacek Laskowski, Commented Nov 19, 2020 at 18:07
concatenate all file paths into a comma delimited string: df = spark.read....load(','.join(s3_paths)) — jxc
– jxc, Commented Nov 19, 2020 at 21:24

user6386471 · Accepted Answer · 2020-11-25 00:34:58Z

@jxc's answer in the comments to the question is the best solution:

df = spark.read.format("com.databricks.spark.xml")\
               .option("rowTag", "head")\
               .load(','.join(s3_paths))

Here is an example using a toy dataset:

fnames = ['books_part1.xml','books_part2.xml'] # part1 -> ids bk101-bk106, part2 -> ids bk107-bk112

df = spark.read.format('xml') \
              .option('rowTag','book')\
              .load(','.join(fnames))

df.show()

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |  _id|              author|         description|          genre|price|publish_date|               title|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |bk101|Gambardella, Matthew|An in-depth look ...|       Computer|44.95|  2000-10-01|XML Developer's G...|
# |bk102|          Ralls, Kim|A former architec...|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
# |bk103|         Corets, Eva|After the collaps...|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
# |bk104|         Corets, Eva|In post-apocalyps...|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
# |bk105|         Corets, Eva|The two daughters...|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
# |bk106|    Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|         Lover Birds|
# |bk107|      Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|       Splish Splash|
# |bk108|       Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
# |bk109|        Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
# |bk110|        O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
# |bk111|        O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
# |bk112|         Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+

Anand Vidvat · Accepted Answer · 2020-11-24 19:42:30Z

1

you can check the following GitHub repo.

https://github.com/databricks/spark-xml

answered Nov 24, 2020 at 19:42

Anand Vidvat

1,0889 silver badges25 bronze badges

Comments

Shubham Jain · Accepted Answer · 2020-08-07 10:22:28Z

0

Just unpack the list

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(*s3_paths)

answered Aug 7, 2020 at 10:22

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

1 Comment

here_to_learn Over a year ago

Tried it - it doesn't work. Exception: java.lang.ClassNotFoundException: Failed to find data source: s3a://somebucket/3/file2.xml.

Collectives™ on Stack Overflow

PySpark: Read multiple XML files (list of s3 paths) in Spark dataframe

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related