How do I output the results of a HiveQL query to CSV using a shell script?

Question

I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1 output in csv1, query2 output in csv2, etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day. I am interested in using a bash shell script because then I'd be able to set-up a cron task to run it at a specific time of day.

I know how to store the results of a HiveQL query in a CSV file, one query at a time. I do that with something like the following:

hive -e 
"SELECT * FROM db.table;" 
" | tr "\t" "," > example.csv;

The problem with the above is that I have to monitor when the process finishes and manually start the next query. I also know how to run multiple queries, in sequence, like so:

hive -f hivequeries.hql

Is there a way to combine these two methods? Is there a smarter way to achieve my goals?

Code answers are preferred since I do not know bash well enough to write it from scratch.

This question is a variant of another question: How do I output the results of a HiveQL query to CSV?

Hive CLI is deprecated. Beeline natively supports this

OneCricketeer
– OneCricketeer

2018-12-14 02:27:25 +00:00
Commented Dec 14, 2018 at 2:27 — OneCricketeer
– OneCricketeer, Commented Dec 14, 2018 at 2:27
stackoverflow.com/questions/39577036/…

OneCricketeer
– OneCricketeer

2018-12-14 02:28:40 +00:00
Commented Dec 14, 2018 at 2:28 — OneCricketeer
– OneCricketeer, Commented Dec 14, 2018 at 2:28

leftjoin · Accepted Answer · 2018-12-13 08:00:53Z

3

You can run and monitor parallel jobs in a shell script:

#!/bin/bash

#Run parallel processes and wait for their completion

#Add loop here or add more calls
hive -e "SELECT * FROM db.table1;" | tr "\t" "," > example1.csv &
hive -e "SELECT * FROM db.table2;" | tr "\t" "," > example2.csv &
hive -e "SELECT * FROM db.table3;" | tr "\t" "," > example3.csv &

#Note the ampersand in above commands says to create parallel process
#You can wrap hive call in a function an do some logging in it, etc
#And call a function as parallel process in the same way
#Modify this script to fit your needs

#Now wait for all processes to complete

#Failed processes count
FAILED=0

for job in `jobs -p`
do
   echo "job=$job"
   wait $job || let "FAILED+=1"
done   

#Final status check
if [ "$FAILED" != "0" ]; then
    echo "Execution FAILED!  ($FAILED)"
    #Do something here, log or send messege, etc
    exit 1
fi

#Normal exit
#Do something else here
exit 0

There are other ways (using XARGS, GNU parallel) to run parallel processes in shell and a lot of resources on it. Read also https://www.slashroot.in/how-run-multiple-commands-parallel-linux and https://thoughtsimproved.wordpress.com/2015/05/18/parellel-processing-in-bash/

edited Dec 13, 2018 at 8:00

answered Dec 13, 2018 at 6:00

leftjoin

38.5k8 gold badges64 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2205916 Over a year ago

Hi @leftjoin this looks promising and I will test it. Normally, I will have, say, 1 screen session with 3 windows, each window running a HiveQL query. The queries in each window finish at different rates. So, there are essentially 3 different pipelines. When running the above script, it will handle the 1st set of 3 queries. How would the script look if I wanted the 2nd set of queries to run? Does the first query in the 2nd set begin only after all of the 1st set queries are completed?

user2205916 Over a year ago

nevermind, it looks like your first reference answers these questions. Thank you.

Ole Tange · Accepted Answer · 2018-12-17 06:51:53Z

0

With GNU Parallel it looks like this:

doit() {
  id="$1"
  hive -e "SELECT * FROM db.table$id;" | tr "\t" "," > example"$id".csv
}
export -f doit
parallel --bar doit ::: 1 2 3 4

If your queries do not share the same template you can do:

queries.txt:
SELECT * FROM db.table1;
SELECT id,name FROM db.person;
... other queries ...

cat queries.txt | parallel --bar 'hive -e {} | tr "\t" "," > example{#}.csv'

Spend 15 minute on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 to learn the basics and chapter 7 to learn more on how to run more jobs in parallel.

answered Dec 17, 2018 at 6:51

Ole Tange

34.2k9 gold badges93 silver badges112 bronze badges

Collectives™ on Stack Overflow

How do I output the results of a HiveQL query to CSV using a shell script?

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related