3

I am trying to get the compression to work.

Original Table defined as :

create external table orig_table (col1 String ...... coln String) 
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS TEXTFILE location '/user/path/to/table/';

The table orig_table has about 10 partitions with 100 rows each

To compress it, I have created a similar table with the only modification from TEXTFILE to ORCFILE

create external table orig_table_orc (col1 String ...... coln String) 
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS ORCFILE location '/user/path/to/table/';

Trying to copy the records across by:

set hive.exec.dynamic.partition.mode=nonstrict;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
[have tried with other codecs as well, with same error]
set mapred.output.compression.type=RECORD;
insert overwrite table zip_test.orig_table_orc partition(pdate) select * from default.orgi_table;

The error I get is:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col1":value ... "coln":value}
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
        ... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:689)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
        ... 9 more

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Same thing works if I make the hive table as a SEQUENCEFILE - not with ORC, any work around? I have seen a couple of questions that have the same error but in a Java program and not Hive QL

1 Answer 1

6

Gaah! ORC is nothing like CSV!!!

Explaining what you did wrong would take a couple of hours and a good many book excerpts about Hadoop and about DB technology in general, so the short answer is: ROW FORMAT and SERDE do not make sense for a columnar format. And since you are populating that table from within Hive, it's not an EXTERNAL but a "managed" table I.M.H.O.

create table orig_table_orc
 (col1 String ...... coln String) 
partitioned by (pdate string)
stored as Orc
location '/where/ever/you/want'
TblProperties ("orc.compress"="ZLIB")
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @Samson, I figured that yesterday. Now, I created a ORC table without serde properties and it works fine. Could you may be give some links or the books you refer to? I wouldn't mind spending a couple of days understanding.
Start with slideshare.net/oom65/orc-andvectorizationhadoopsummit (a bit old, does not cover recent features e.g. "streaming" inserts) then delve into cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
If you have demanding perf requirements, look into streever.atlassian.net/wiki/display/HADOOP/… (tuning ORC table for hot/cold data) and thinkbig.teradata.com/… (setting bytes.per.reducer to match your compression ratio)
For some background about the concepts involved in ORC design, and specifically the "stripes", investigate about Infobright "data packs", Netezza "zone maps" (and incidentally about the way Oracle Exadata does "smart scan")

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.