5

As the question suggests, I have a list of s3 paths in a list

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

I'm using PySpark and want to find how I can load all these XML files in dataframe together? Something similar to the example shown below.

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(s3_paths)

I'm able to read a single file but want to find the best way to load all files.

2
  • How do you start pyspark to execute the spark.read.format code? Commented Nov 19, 2020 at 18:07
  • 1
    concatenate all file paths into a comma delimited string: df = spark.read....load(','.join(s3_paths)) Commented Nov 19, 2020 at 21:24

3 Answers 3

4

@jxc's answer in the comments to the question is the best solution:

df = spark.read.format("com.databricks.spark.xml")\
               .option("rowTag", "head")\
               .load(','.join(s3_paths))

Here is an example using a toy dataset:

fnames = ['books_part1.xml','books_part2.xml'] # part1 -> ids bk101-bk106, part2 -> ids bk107-bk112

df = spark.read.format('xml') \
              .option('rowTag','book')\
              .load(','.join(fnames))

df.show()

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |  _id|              author|         description|          genre|price|publish_date|               title|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |bk101|Gambardella, Matthew|An in-depth look ...|       Computer|44.95|  2000-10-01|XML Developer's G...|
# |bk102|          Ralls, Kim|A former architec...|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
# |bk103|         Corets, Eva|After the collaps...|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
# |bk104|         Corets, Eva|In post-apocalyps...|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
# |bk105|         Corets, Eva|The two daughters...|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
# |bk106|    Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|         Lover Birds|
# |bk107|      Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|       Splish Splash|
# |bk108|       Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
# |bk109|        Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
# |bk110|        O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
# |bk111|        O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
# |bk112|         Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

you can check the following GitHub repo.

Comments

0

Just unpack the list

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(*s3_paths)

1 Comment

Tried it - it doesn't work. Exception: java.lang.ClassNotFoundException: Failed to find data source: s3a://somebucket/3/file2.xml.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.