Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

Question

Recently, Databricks launched Databricks Connect that

allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session.

It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this:

spark.read.json("abfss://...").count()

I get this error:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found   at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

Does anybody know how to fix this?

Further information:

databricks-connect version: 5.3.1

simon_dmorias · Accepted Answer · 2019-06-22 06:01:29Z

1

If you mount the storage rather use a service principal you should find this works: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html

I posted some instructions around the limitations of databricks connect here. https://datathirst.net/blog/2019/3/7/databricks-connect-limitations

answered Jun 22, 2019 at 6:01

simon_dmorias

2,4933 gold badges23 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tomconte Over a year ago

This is a useful workaround, thanks! Do you know if there is any way to report this issue (and/or other limitations you found) to Databricks?

Conor B Murphy · Accepted Answer · 2020-01-06 22:47:10Z

0

Likely too late but for completeness' sake, there's one issue to look out for on this one. If you have this spark conf set, you'll see that exact error (which is pretty hard to unpack):

fs.abfss.impl org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem

So you can double check the spark configs to make sure you have the permissions to directly access ADLS gen2 using the storage account access key.

answered Jan 6, 2020 at 22:47

Conor B Murphy

112 bronze badges

Collectives™ on Stack Overflow

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related