0

I am trying to pass data from one component to the next in Azure ML pipeline. I am able to do it in a simple code.

I have 2 components and I am defining them as below:

components_dir = "."
prep = load_component(source=f"{components_dir}/preprocessing_config.yml")
middle = load_component(source=f"{components_dir}/middle_config.yml")

Then I am defining a pipeline as below:

@pipeline(
    display_name="test_pipeline3",
    tags={"authoring": "sdk"},
    description="test pipeline to test things just like all other test pipelines."
)
def data_pipeline(
    # raw_data: Input,
    compute_train_node: str,
):
   
    prep_node = prep()

    prep_node.outputs.Y_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/") 
    prep_node.outputs.S_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")


    transform_node = middle(Y_df=prep_node.outputs.Y_df,
                            S_df=prep_node.outputs.S_df)

The prep node has a script which involves hydra to get in the parameters from a config file. This script also has a config file that kicksoff the script in command line as below:

  python preprocessing_script.py
  --Y_df ${{outputs.Y_df}} 
  --S_df ${{outputs.S_df}}

I try to get the values of Y_df.path and S_df.path in the main function of the prep script as below:

@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):

    parser = argparse.ArgumentParser("prep")
    parser.add_argument("--Y_df", type=str, help="Path of prepped data")
    parser.add_argument("--S_df", type=str, help="Path of prepped data")
    args = parser.parse_args()
   
    # Call the preprocessing function with Hydra configurations
    df1,df2 = processing_func(cfg.data_name,cfg.prod_filter)
    df1.to_csv(Path(cfg.Y_df) / "Y_df.csv")
    df2.to_csv(Path(cfg.S_df) / "S_df.csv")

When I run all of this, I get an error in the prep component itself saying

Execution failed. User process 'python' exited with status code 2. Please check log file 'user_logs/std_log.txt' for error details. Error: /bin/bash: /azureml-envs/azureml_bbh34278yrnrfuehn78340/lib/libtinfo.so.6: no version information available (required by /bin/bash)
usage: data_processing.py [--help] [--hydra-help] [--version]
                          [--cfg {job,hydra,all}] [--resolve]
                          [--package PACKAGE] [--run] [--multirun]
                          [--shell-completion] [--config-path CONFIG_PATH]
                          [--config-name CONFIG_NAME]
                          [--config-dir CONFIG_DIR]
                          [--experimental-rerun EXPERIMENTAL_RERUN]
                          [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
                          [overrides ...]
data_processing.py: error: unrecognized arguments: --Y_df --S_df /mnt/azureml/cr/j/ffyh7fs984ryn8f733ff3/cap/data-capability/wd/S_df

The code runs fine and data is transferred between the components when there is no hydra involved but when hydra is involved, I get this error. why is that so?

Edit: Below is the data component config file for prep:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: preprocessing24
display_name: preprocessing24


outputs:
  Y_df:
    type: uri_folder

  S_df:
    type: uri_folder



code: ./preprocessing_final


environment: azureml:datapipeline-environment:4

command: >-
  python data_processing.py

data preprocessing config file just contains a bunch of variables but I have added 2 more which are:

Y_df:
  random_txt

S_df:
  random_txt

the main function of the data processing script is mentioned above.

4
  • add the config file and code you are running in data_processing.py. Commented Jan 16 at 3:35
  • ok, editing the question Commented Jan 16 at 14:31
  • Pass the args like this python data_processing.py --Y_df ${{outputs.Y_df}} --S_df ${{outputs.S_df}} also only defining main function doesn't work ,you need to call inside python file Commented Jan 16 at 14:55
  • When I pass the args like that, I get the error I mentioned above Commented Jan 16 at 15:21

2 Answers 2

1

hydra and argparse are natively not compatible, as hydra handles the the parsing.

If you want to combine both its easiest not use @hydra.main but Hydra's Compose API. Which takes care of some but not all setup features, iirc the custom logger output is not included the last time I used it.

The arguments for the compose API align with hydra.main, for the argparser use ArgumentParser.parse_known_args

import sys
import argparse
from hydra import compose, initialize
from omegaconf import OmegaConf   # optional for printing

def main():
    # global initialization
    parser = argparse.ArgumentParser("prep")
    parser.add_argument("--Y_df", type=str, help="Path of prepped data")
    parser.add_argument("--S_df", type=str, help="Path of prepped data")
    args, unparsed_args = parser.parse_known_args()  # <- ignore unknown args

    # Before running hydra; remove the already parsed arguments
    sys.argv[1:] = unparsed_args

    initialize(version_base=None, config_path="conf", job_name="test_app")
    cfg = compose(config_name="config", overrides=["db=mysql", "db.user=me"])
    print(OmegaConf.to_yaml(cfg))

Alternatively you also parse the args before using @hydra.main in a similar way.

import sys
import argparse
import hydra

# guard with if __name__ == "__main__": if needed
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args, unparsed_args = parser.parse_known_args()
sys.argb[1:] = unparsed_args

@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):
   # work with cfg and args or merge
   ...
Sign up to request clarification or add additional context in comments.

Comments

0

Ok here is what was happening.

This notation in CLI script did not work

  python preprocessing_script.py
  --Y_df ${{outputs.Y_df}} 
  --S_df ${{outputs.S_df}}

Thats because hydra does not like that notation (I think)

Instead this notation worked:

  python data_processing.py '+Y_df=${{outputs.Y_df}}' '+S_df=${{outputs.S_df}}'

What this does is that it adds those 2 new variables - Y_df and S_df into the config file These variables can be accessed in the program just like all other variables in the config file by doing cfg.Y_df or cfg.S_df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.