2

I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1 output in csv1, query2 output in csv2, etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day. I am interested in using a bash shell script because then I'd be able to set-up a cron task to run it at a specific time of day.

I know how to store the results of a HiveQL query in a CSV file, one query at a time. I do that with something like the following:

hive -e 
"SELECT * FROM db.table;" 
" | tr "\t" "," > example.csv;

The problem with the above is that I have to monitor when the process finishes and manually start the next query. I also know how to run multiple queries, in sequence, like so:

hive -f hivequeries.hql

Is there a way to combine these two methods? Is there a smarter way to achieve my goals?

Code answers are preferred since I do not know bash well enough to write it from scratch.

This question is a variant of another question: How do I output the results of a HiveQL query to CSV?

2

2 Answers 2

3

You can run and monitor parallel jobs in a shell script:

#!/bin/bash

#Run parallel processes and wait for their completion

#Add loop here or add more calls
hive -e "SELECT * FROM db.table1;" | tr "\t" "," > example1.csv &
hive -e "SELECT * FROM db.table2;" | tr "\t" "," > example2.csv &
hive -e "SELECT * FROM db.table3;" | tr "\t" "," > example3.csv &

#Note the ampersand in above commands says to create parallel process
#You can wrap hive call in a function an do some logging in it, etc
#And call a function as parallel process in the same way
#Modify this script to fit your needs

#Now wait for all processes to complete

#Failed processes count
FAILED=0

for job in `jobs -p`
do
   echo "job=$job"
   wait $job || let "FAILED+=1"
done   

#Final status check
if [ "$FAILED" != "0" ]; then
    echo "Execution FAILED!  ($FAILED)"
    #Do something here, log or send messege, etc
    exit 1
fi

#Normal exit
#Do something else here
exit 0

There are other ways (using XARGS, GNU parallel) to run parallel processes in shell and a lot of resources on it. Read also https://www.slashroot.in/how-run-multiple-commands-parallel-linux and https://thoughtsimproved.wordpress.com/2015/05/18/parellel-processing-in-bash/

Sign up to request clarification or add additional context in comments.

2 Comments

Hi @leftjoin this looks promising and I will test it. Normally, I will have, say, 1 screen session with 3 windows, each window running a HiveQL query. The queries in each window finish at different rates. So, there are essentially 3 different pipelines. When running the above script, it will handle the 1st set of 3 queries. How would the script look if I wanted the 2nd set of queries to run? Does the first query in the 2nd set begin only after all of the 1st set queries are completed?
nevermind, it looks like your first reference answers these questions. Thank you.
0

With GNU Parallel it looks like this:

doit() {
  id="$1"
  hive -e "SELECT * FROM db.table$id;" | tr "\t" "," > example"$id".csv
}
export -f doit
parallel --bar doit ::: 1 2 3 4

If your queries do not share the same template you can do:

queries.txt:
SELECT * FROM db.table1;
SELECT id,name FROM db.person;
... other queries ...

cat queries.txt | parallel --bar 'hive -e {} | tr "\t" "," > example{#}.csv'

Spend 15 minute on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 to learn the basics and chapter 7 to learn more on how to run more jobs in parallel.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.