3

May I know how to execute HDFS copy commands on DataProc cluster using airflow. After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.

2 Answers 2

2

You can execute hdfs commands on dataproc cluster using something like this

gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster -- 
region=europe-west1

The easiest way is [1] via

gcloud dataproc jobs submit pig --execute 'fs -ls /'

or otherwise [2] as a catch-all for other shell commands.

For a single small file

You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:

hdfs dfs -cp gs://<bucket>/<object> <hdfs path>

This works because

hdfs://<master node> 

is the default filesystem. You can explicitly specify the scheme and NameNode if desired:

hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>

For a large file or large directory of files

When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:

hadoop distcp  gs://<bucket>/<directory> <HDFS target directory>

Consider [3] for details.

[1] https://pig.apache.org/docs/latest/cmds.html#fs

[2] https://pig.apache.org/docs/latest/cmds.html#sh

[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

Sign up to request clarification or add additional context in comments.

3 Comments

Hi Pooja,Thanks for your answer.
How to execute it using Airflow?
After executing hdfs commands on dataproc as mentioned in the answer above, you need to make use of dataproc operators to execute hdfs commands in airflow. Example:DataProcHadoopOperator helps to start a Hadoop Job on a Cloud DataProc cluster.
0

I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.

https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/

Airflow Dataproc operator to run shell scripts

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.