1

Can you please help me to understand the difference between Spark SQl and Hive?

1 Answer 1

5

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

Built on top of Apache Hadoop, Hive provides the following features:

  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
  • Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
  • Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
  • A mechanism to impose structure on a variety of data formats

Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.

Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.

Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.