1

I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.

Conceptually this is pretty simple:

  1. Run a .usql script to generate intermediate data (two tables, intermediate_1 and intermediate_2) from a large initial_table
  2. Run a Python script over the intermediate data to generate the final result final

What should be the Azure Machine Learning Pipeline steps to do this?

I thought the following plan would work:

  1. Run the .usql query on a adla_compute using an AdlaStep like

    int_1 = PipelineData("intermediate_1", datastore=adls_datastore)
    int_2 = PipelineData("intermediate_2", datastore=adls_datastore)
    
    adla_step = AdlaStep(script_name='script.usql',
                         source_directory=sample_folder,
                         inputs=[initial_table],
                         outputs=[intermediate_1, intermediate_2],
                         compute_target=adla_compute)          
    
  2. Run a Python step on a compute target aml_compute like

    python_step = PythonScriptStep(script_name="process.py",
                                   arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final],
                                   inputs=[intermediate_1, intermediate_2],
                                   outputs=[final],    
                                   compute_target=aml_compute, 
                                   source_directory=source_directory)
    

This however fails at the Python step with an error of the kind

StepRun(process.py) Execution Summary

======================================
StepRun(process.py) Status: Failed

Unable to mount data store mydatastore because it does not specify a storage account key.

I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore Azure Data Lake data store reference on which I am running the U-SQL queries against.

Can someone smell if I am doing something really wrong here? Should I move the intermediate data (intermediate_1 and intermediate_2) to a storage account, e.g. with a DataTransferStep, before the PythonScriptStep?

2 Answers 2

1

ADLS does not support mount. So, you are right, you will have to use DataTransferStep to move data to blob first.

Sign up to request clarification or add additional context in comments.

Comments

1

Data Lake store is not supported for AML compute. This table lists different computes and their level of support for different datastores: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#compute-and-datastore-matrix

You can use DataTransferStep to copy data from ADLS to blob and then use that blob as input for PythonScriptStep. Sample notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb


# register blob datastore, example in linked notebook
# blob_datastore = Datastore.register_azure_blob_container(...

int_1_blob = DataReference(
    datastore=blob_datastore,
    data_reference_name="int_1_blob",
    path_on_datastore="int_1")

copy_int_1_to_blob = DataTransferStep(
    name='copy int_1 to blob',
    source_data_reference=int_1,
    destination_data_reference=int_1_blob,
    compute_target=data_factory_compute)

int_2_blob = DataReference(
    datastore=blob_datastore,
    data_reference_name="int_2_blob",
    path_on_datastore="int_2")

copy_int_2_to_blob = DataTransferStep(
    name='copy int_2 to blob',
    source_data_reference=int_2,
    destination_data_reference=int_2_blob,
    compute_target=data_factory_compute)

# update PythonScriptStep to use blob data references
python_step = PythonScriptStep(...
                               arguments=["--input1", int_1_blob, "--input2", int_2_blob, "--output", final],
                               inputs=[int_1_blob, int_2_blob],
                               ...)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.