2

I'm using Airflow, Databricks, and PySpark. I would like to know if it is possible to add more parameters when I want to execute a Databricks Notebook through Airflow.

I had the next code in Python named MyETL:

def main(**kwargs):
      spark.sql("CREATE TABLE {0} {1}".format(table, columns))
      print("Running my ETL!")

    if __name__== "__main__":
      main(arg1, arg2)

I want to define others task params that run a Databricks notebook with more params, I wanna add the name of the method, and the parameters of these methods. For example when I want to register tasks in a DAG in Airflow:

   notebook_task_params = {
        'new_cluster': new_cluster,
        'notebook_task': {
            'notebook_path': '/Users/[email protected]/MyETL',
            'method_name': 'main',
            'params':'[{'table':'A'},{'columns':['a', 'b']}]'
        },
    }

I don't know if that is possible because I didn't find similar examples.

# Example of using the JSON parameter to initialize the operator.
notebook_task = DatabricksSubmitRunOperator(
    task_id='notebook_task',
    dag=dag,
    json=notebook_task_params)

In other words, I want to execute a notebook with parameters using Airflow. My question is how can I do that?

1 Answer 1

2

You can add the method_name as params also, then parse out your logic on the notebook.

However, the more common pattern here is to make sure the method is already installed on your cluster.

params = '[{'table':'A'},{'columns':['a', 'b']}]'

Then in your notebook on databricks:

table = getArgument("table", "DefaultValue")
columns = getArgument("columns", "DefaultValue")

result = method(table, columns)

You'll also know if the params are accessible with getArgument() if you can see your params (image attached above) in your notebook job run.

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

getArgument is an Airflow method? So the keyword "params" it's okay? How will you write the notebook_task_params?
getArgument is a special databricks function that grabs a variable from the environment. Apparently deprecated in favour of get now. docs.databricks.com/user-guide/dev-tools/…. Your DAG seems fine, besides method_name. Try submitting a job run and calling print get('table') in your notebook to test if the params were properly passed through. You should also see your params in the databricks notebook of the job run. I'll attach an image to my previous answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.