Can you please help me to understand the difference between Spark SQl and Hive?
1 Answer
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Built on top of Apache Hadoop, Hive provides the following features:
- Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
- Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
- Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
- A mechanism to impose structure on a variety of data formats
Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.
Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.
Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.