Spark SQL using Python: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Question

I want to test basic stuff with Spark SQL. I want to load a csv. file, saved on my laptop, and run a few sql queries on it. But somehow I cannot load the data using sqlContext. I get the error:

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.

I am not using Hive, however.

I am using windows 10 and installed python using Anaconda. I installed Spark 2.0.2 prebuild for hadoop 2.6. I use iPython Notebook as a User Interface.

My code is as follows:

file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv"
df = sqlContext\
    .read \
    .format("com.databricks.spark.csv")\
    .option("header", "true")\
    .option("inferschema", "true")\
    .option("mode", "DROPMALFORMED")\
    .load(file)

The problem lies in Spark SQL since I can load the same file using

textFile=sc.textFile("C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv")

If I want to run an example from the Spark SQL documentation https://spark.apache.org/docs/latest/sql-programming-guide.html I get the same error.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
df = spark.read.json("C:/Andra/spark-2.0.2-bin-hadoop2.6/examples/src/main/resources/people.json")

I was under the impression that I can use Spark SQL without using Hive, since the data I am using is saved localy on my laptop. Furthermore the same documentation as above implies just that:

"One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section."

And there are also examples of creating a spark session using Hive. So the one above would be useless, if using hive was mandatory.

However, I wanted to configure Hive to see if this solves the problem. The documentation guide (https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables) states

"Configuration of Hive is done by placing your hive-site.xml , core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/."

I could not find those documents, though.

So my questions are these:

Do I need Hive for using Spark SQL?
If not, what can I do to get Spark SQL working?
If yes, how can I configure it correcty and were can I find those files needed?

Any help is appreciated! Thank you!

Here is the complete error statement:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-e50d7a8fb32b> in <module>()
      1 file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv"
----> 2 df = sqlContext    .read     .format("com.databricks.spark.csv")    .option("header", "true")    .option("inferschema", "true")    .option("mode", "DROPMALFORMED")    .load(file)

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in load(self, path, format, schema, **options)
    145         self.options(**options)
    146         if isinstance(path, basestring):
--> 147             return self._df(self._jreader.load(path))
    148         elif path is not None:
    149             if type(path) != list:

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o110.load.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
    at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
    at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
    at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
    at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
    at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
    at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
    at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
    at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
    ... 33 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
    ... 39 more
Caused by: java.lang.NullPointerException
    at org.apache.thrift.transport.TSocket.open(TSocket.java:170)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    ... 44 more

I hit the same problem. In my case, I was following the official quick guide and running both spark-shell and pyspark at the same time. After I exit from spark-shell, and restart pyspark, it worked. — Dagang Wei
– Dagang Wei, Commented Oct 13, 2018 at 4:48

AEDWIP · Accepted Answer · 2018-04-02 18:48:54Z

8

I recently ran into the same problem. In my case I was running two python jupyter notebooks on my local computer at the same time. The first notebook worked fine. The second one kept throwing the dreaded

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient I am not sure how permissions work. It seems like the first notebook to run some how locks the local meta store. Make sense that meta store can not be shared between two different sessions.

maybe someone knows how enable multiple notes books?

Andy

answered Apr 2, 2018 at 18:48

AEDWIP

9682 gold badges9 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Asif Ali Over a year ago

After reading this, I assumed there's a lock on that file, restarted my system and it worked like a charm.

FlameStorm Over a year ago

Approve that single notebook works good. Magick! And when opening multiple notebooks with runned kernels it raises errors still (02.2019, pyspark 2.3.2, Win7x64, anaconda and other)

Ramakant · Accepted Answer · 2019-01-03 17:33:30Z

0

you should change the permission of /tmp/hive directory. in linux, chomd 777 /tmp/hive. After then, restart the pyspark/hive shell.

That is worked in my case.

answered Jan 3, 2019 at 17:33

Ramakant

1951 gold badge3 silver badges10 bronze badges

Comments

Kayf · Accepted Answer · 2019-05-03 20:43:22Z

0

I had the same "bug" today.

To be able to use the same SparkSession with different notebooks, you need to use the same kernel (with jupyterlab, "kernel" > "change kernel" and select the same for all notebooks)

answered May 3, 2019 at 20:43

Kayf

465 bronze badges

Collectives™ on Stack Overflow

Spark SQL using Python: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related