13

I am very new to Apache Spark.
I have already configured spark 2.0.2 on my local windows machine. I have done with "word count" example with spark.
Now, I have the problem in executing the SQL Queries. I have searched for the same , but not getting proper guidance .

5
  • So, what's your problem? You're getting some error? Commented Nov 28, 2016 at 10:32
  • error: not found: value sqlContext Commented Nov 29, 2016 at 7:07
  • I am getting the above exception while running the below command Commented Nov 29, 2016 at 7:08
  • val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/mydb").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "mydb").option("user", "root").option("password", "").load() Commented Nov 29, 2016 at 7:08
  • 1
    not sure why it's down voted. I find this question helpful! Commented Nov 5, 2017 at 20:11

5 Answers 5

16

So you need to do these things to get it done ,

In Spark 2.0.2 we have SparkSession which contains SparkContext instance as well as sqlContext instance.

Hence the steps would be :

Step 1: Create SparkSession

val spark = SparkSession.builder().appName("MyApp").master("local[*]").getOrCreate()

Step 2: Load from the database in your case Mysql.

val loadedData=spark
      .read
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/mydatabase")
      .option("driver", "com.mysql.jdbc.Driver")
      .option("mytable", "mydatabase")
      .option("user", "root")
      .option("password", "toor")
      .load().createOrReplaceTempView("mytable")

Step 3: Now you can run your SqlQuery just like you do in SqlDatabase.

val dataFrame=spark.sql("Select * from mytable")
dataFrame.show()

P.S: It would be better if you use DataFrame Api's or even better if DataSet Api's , but for those you need to go through the documentation.

Link to Documentation: https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset

Sign up to request clarification or add additional context in comments.

2 Comments

Could you add some arguments behind your suggestion to use the Dataset API? Otherwise the statement is just an opinion and should be ignored. Keep in mind that the SQL API is usually ahead (e.g. many higher order functions were introduced in Spark SQL 2.4, but they were not available to the Dataset API).
Assuming this way, the query will be fired from the driver, right? Also, the results will be loaded into the driver's memory, correct?
9

In Spark 2.x you no longer reference sqlContext, but rather spark, so you need to do:

spark
  .read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", "mydb")
  .option("user", "root")
  .option("password", "")
  .load()

2 Comments

Assuming this way, the query will be fired from the driver, right? Also, the results will be loaded into the driver's memory, correct?
No, not correct. Data will be loaded by workers into worker memory as always. But I believe there are also settings for how many concurrent connections it should use.
2

You should have your Spark DataFrame.
Create a TempView out of DataFrame

df.createOrReplaceTempView("dftable")
dfsql = sc.sql("select * from dftable")

You can use long queries in statement format:

sql_statement = """
select sensorid, objecttemp_c,
year(DateTime) as year_value,
month(DateTime) as month_value,
day(DateTime) as day_value,
hour(DateTime) as hour_value
from dftable
order by 1 desc
"""

dfsql = sc.sql(sql_statement)

Comments

0

Its rather simple now in spark to do SQL queries. You can do SQL on dataframes as pointed out by others but the questions is really how to do SQL.

spark.sql("SHOW TABLES;") that's it.

6 Comments

With spark in ASA (Azure Synapse Analytics), there is "lazy loading" ... so spark.sql creates the dataframe, but does not yet execute. Execution waits until there's an output action. Does spark.sql immediately execute, in other implementations of spark?
spark.sql executes SQL. I believe the intention is to use it over transformations in python. If you want to do the transformations in python on dataframes then it's going to use lazy evaluation. But if you use SQL and join a few tables, do some calls, and write to a table that's done in lazy eval.. but it has an action so its executed. SQL is declarative python is not.
Thanks for the reply. I guess you meant that doing it in SQL, it would have eager eval? I got the point in any case.
SQL will still execute in lazy evaluation. Spark will always. What you won't have is the chance to notice the lazy evaluation since you are including an action. FULL SQL is not procedural. If I write create this table, then it does it.
Ok, thanks. By "full sql", are you including what runs in an azure synapse analytics notebook cell, that starts with %%sql? Because my experience is that those %%sql cells execute immediately -- eager eval -- even if the notebook as a whole, is set to pyspark.
|
0

Executing SQL Queries using spark.sql() or dataset API will compile to exactly same code by the catayst optimiser at compile time and AQE at runtime. You can choose whatever you are comfortable.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.