0

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.

zipContainer/deviceA/component1/20220301.zip

The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.

unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv

I enabled the logging of the copy activity as:

enter image description here

And then provided the folder path to store the generated logs (in txt format), which have the following structure:

Timestamp Level OperationName OperationItem Message
2022-03-01 15:14:06.9880973 Info FileWrite "deviceA/component1/2022.zip/measurements_01.csv" "Complete writing file. File is successfully copied."

I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:

Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")

The following exception is returned:

Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.

From the above scenario, how can I read these logs successfully into my R session in DataBricks ?

2
  • what databricks runtime is used? Commented May 19, 2022 at 14:46
  • I am using 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) Commented May 19, 2022 at 14:48

2 Answers 2

1

According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.

https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types

And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from databricks.

Using blob storage for logging:

enter image description here

Using ADLS gen2 for logging:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

As mentionned earlier, WASB does not support append blobs reading. However abfss does support it.

Classic operations such as Dataframe reading still use WASB, however you can force abfss by providing a different path.

On a Python Notebook, you could do the following to read an append blob :

df = spark.read.format("csv").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

Don't forget to provide access to your storage account first, with a spark.conf.set

Hope this help

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.