3

Running Azure Databricks on Runtime 8.4.

I have a csv file that is data from AdventureWorks. I have placed it in Azure DLS Gen2 storage, blob container.

I have mounted the container using dbutils.fs.mount with configs set to use OAuth and dbx secret scope. I can successfully mount the drive and can list csv files from it.

I have created a database, and then created a table.

create database if not exists work_db;

CREATE OR REPLACE TABLE  dim_account 
( 
`AccountKey` STRING, `ParentAccountKey` STRING, `AccountCodeAlternateKey` STRING, `ParentAccountCodeAlternateKey` STRING, `AccountDescription` STRING, `AccountType` STRING, `Operator` STRING, `CustomMembers` STRING, `ValueType` STRING, `CustomMemberOptions` STRING, `corrupt` STRING
) ;

I have run all kinds of variations of the following (trying different format options)

use work_db;

truncate table dim_account;

copy into dim_account
  from (
    select AccountKey, ParentAccountKey, AccountCodeAlternateKey, ParentAccountCodeAlternateKey,
    AccountDescription, AccountType, Operator, CustomMembers, ValueType, CustomMemberOptions
    from 'dbfs:/mnt/csv_source'
  )
  FILEFORMAT = csv
  FILES = ('DimAccount.csv')
  FORMAT_OPTIONS('header'='true','columnNameOfCorruptRecord'='corrupt')
;

select * from dim_account;

I believe there was a point where it pulled data from the csv file but it now does not. I get the following output (without the select):

num_affected_rows num_inserted_rows

0 0

But, if I do something like the following (also tried a number of variations):

%python

dataSchema = "AccountKey STRING, ParentAccountKey STRING, AccountCodeAlternateKey STRING, ParentAccountCodeAlternateKey STRING, AccountDescription STRING, AccountType STRING, Operator STRING, CustomMembers STRING, ValueType STRING, CustomMemberOptions STRING, corrupt STRING"

diamonds = spark.read.csv('/mnt/csv_source/DimAccount.csv', \
    header=True, schema=dataSchema, enforceSchema=True, columnNameOfCorruptRecord='corrupt').cache()

diamonds.createOrReplaceTempView('dim_account_error_on_load')

I have no problem retrieving data.

I really dont know what's going on with the COPY INTO (I'm new to all this). Work would prefer to use SQL (or perhaps something no more complex than spark.sql()) so I'm trying to make the copy into work. The COPY INTO can also operate on multiple files for a given file structure which we are interested in doing. I'm just trying to get one file to work, let alone multiple.

2
  • Have you tried using full path for reading and writing for e.g delta.abfss://[email protected]/deltaTables/target ? Commented Oct 11, 2021 at 9:48
  • @AliHasan No I had not. If I try to change the path in the "from" I get an error that suggests an access issue. I'm assuming I'd have to modify security in order to have Databricks use that vector rather than OAuth, secret scopes, mount. However, I can prove that COPY INTO sees the file. If I use header=True and dont reference correct column names I get an error. If I set one of the columns to INT instead of STRING, in the target table, I get an error attempting to implicitly convert source column to INT (which will fail). So COPY INTO is in touch with both sides. Commented Oct 11, 2021 at 19:54

1 Answer 1

2

I got an answer from work. The issue is very simple: COPY INTO tracks files that it has already processed. By default, if you attempt to process the same file (at least by name), it wont load data. There is an option to force the load of such a file.

Sigh... it's hard being a noob.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.