0

I am new to working with json data on hive. I am working on a spark application that gets json data and stores it into hive tables. I have a json like this:

Json of Jsons

which looks like this when expanded:

hierarchy

I am able to read the json into a dataframe and save it in a location on HDFS. But getting hive to be able to read the data is the tough part.

After I've searched online for example, I've tried to do this:

using the STRUCT for all the json fields and then access the elements using column.element.

For example:

web_app_security will be the name of a column(of type STRUCT) inside the table and the other jsons in it like config_web_cms_authentication, web_threat_intel_alert_external will also be Structs(with rating and rating_numeric as the fields).

I tried creating the table with json serde. here is my table definition:

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float,  threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float,  data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>,  hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

I've tried to parse the rows with the json serde. After I've saved some data to the table, I get the following error when I try to query it:

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

I am not sure if I am doing it the right way.

I am open to any other ways of storing the data into the table as well. Any help would be appreciated. Thank you.

1 Answer 1

1

That's because you're mixing ORC as a storage (STORED AS orc) and JSON as a SerDe (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe') overriding ORC's default OrcSerde SerDe, but not input (OrcInputFormat) and output (OrcOutputFormat) formats.

You either need to use ORC storage without overriding its default SerDe. In this case, make sure that your Spark application writes into ORC tables, not JSON.

Or, if you want data to be stored in JSON, use JsonSerDe together with a plain text file as a storage (STORED AS TEXTFILE).


Hive Developer Guide has explanation on how SerDe and Storage work -https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the answer. I treid to save my dataframe as a text file using df.rdd.saveAsTextFile("Path") but I get his error ` org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory already exists` I am not sure why it is trying to create a new directory for every dataframe instead of creating a new file in the given path. Is there a better way of saving the dataframe as text file? or Is there a way I can save the data frame as csv and give a proper table definition to read the csv files and use the json serde?? @Sergey Khudyakov
@HemanthAnnavarapu take a look at df.write instead, and especially df.write.mode(SaveMode). I don't know why are you referring to CSV files now, but I strongly recommend you to read Hive Developer Guide (link in the answer) first and Spark DataFrame API documentation. The "better way" really depends on what do you want to achieve, what kind of tables do you want to have in Hive, etc.
I am trying with csv now because I've searched for saving a dataframe as text and in most of the examples they are referring to using DF.write.format("org.databricks.spark.csv").save("path") . Since Df.saveAsTextFile doesn't work, I'm trying with csv. Is there a way I can still use the serde on csv files? Or is it possible for me to override input format by using the orc format? @Sergey Khudyakov
I was able to get it working by saving the dataframe as json. Didn't mention STORED AS in the table definition.
@HemanthAnnavarapu good that you solved it. Hive uses TEXTFILE storage by default.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.