How to Read Append Blobs as DataFrames in Azure DataBricks

Question

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.

zipContainer/deviceA/component1/20220301.zip

The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.

unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv

I enabled the logging of the copy activity as:

And then provided the folder path to store the generated logs (in txt format), which have the following structure:

Timestamp	Level	OperationName	OperationItem	Message
2022-03-01 15:14:06.9880973	Info	FileWrite	"deviceA/component1/2022.zip/measurements_01.csv"	"Complete writing file. File is successfully copied."

I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:

Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")

The following exception is returned:

Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.

From the above scenario, how can I read these logs successfully into my R session in DataBricks ?

I am using 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) — khidir sanosi
– khidir sanosi, Commented May 19, 2022 at 14:48

John Rotenstein · Accepted Answer · 2023-05-17 05:48:15Z

1

According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.

https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types

And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from databricks.

Using blob storage for logging:

Using ADLS gen2 for logging:

edited May 17, 2023 at 5:48

John Rotenstein

273k28 gold badges456 silver badges541 bronze badges

answered Jun 3, 2022 at 3:07

Saideep Arikontham

6,1922 gold badges6 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

paulGR · Accepted Answer · 2023-07-06 13:57:07Z

0

As mentionned earlier, WASB does not support append blobs reading. However abfss does support it.

Classic operations such as Dataframe reading still use WASB, however you can force abfss by providing a different path.

On a Python Notebook, you could do the following to read an append blob :

df = spark.read.format("csv").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

Don't forget to provide access to your storage account first, with a spark.conf.set

Hope this help

answered Jul 6, 2023 at 13:57

paulGR

1

Collectives™ on Stack Overflow

How to Read Append Blobs as DataFrames in Azure DataBricks

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related