4

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):

{
    "location1": "project:bq_dataset.bq_table1",
    #...
    "location10": "project:bq_dataset.bq_table10",
    "location17": "project:bq_dataset.bq_table17"
}

I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud CLI like this:

gcloud dataflow jobs run job-201807301630 /
    --gcs-location=gs://bucketname/dataprep/dataprep_template /
    --parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}

But I'm getting an error:

ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv

From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:

input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...

I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in [], using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?

1 Answer 1

20

I finally found a solution for this via a huge process of trial and error. There are several steps involved.

Format of --parameters

The --parameters argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping in the CLI, but in short it means you'll need an = between --parameters and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("):

--parameters=inputLocations="object",outputLocations="object"

Escape the objects

Then, the objects need the quotes escaping to avoid ending the value prematurely, so

{"location1":"gcs://bucket/whatever"...

Becomes

{\"location1\":\"gcs://bucket/whatever\"...

Choose a different separator

Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^) at the start of the argument and between the key=value pairs:

--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"

I used * because ; didn't work - maybe because it marks the end of the CLI command? Who knows.

Note also that the gcloud topic escaping info says:

In cmd.exe and PowerShell on Windows, ^ is a special character and you must escape it by repeating it. In the following examples, every time you see ^, replace it with ^^^^.

Don't forget customGcsTempLocation

After all that, I'd forgotten that customGcsTempLocation needs adding to the key=value pairs in the --parameters argument. Don't forget to separate it from the others with a * and enclose it in quote marks again:

...}*customGcsTempLocation="gs://bucket/whatever"

Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.

Sign up to request clarification or add additional context in comments.

2 Comments

hello Adam, thanks for this post. it helped me deploy my pipeline. On a different note, are you doing all the ETL operations via java script. is there a reference article I can refer to , to help develop complex ETL operations on Dataflow. any pointers will help a lot.
@akashsharma no, we're using Cloud Dataprep, which generates a Dataflow template

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.