Apache Spark makes SQL query faster?

Question

From apache-spark-makes-slow-mysql-queries-10x-faster

For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

It looks great but i am not able to think practical example of query where query can be divided in subqueries and multiple cores van make it faster instead of running it on one core ?

It's all about the size of data. If it's just selecting the top 10 rows, perhaps plain SQL would be faster. But when you are talking about fetching massive amount of data in the form of a big table and then performing some operations on it like joins etc, then plain old SQL would die of exhaustion. SparkSQL converts those operations into map-reduce jobs utilizing multiple cores and ends up performing faster. You wouldn't need a cannon to kill a fly, but you cannot drown a ship using a spatula! — smaug
– smaug, Commented Jun 9, 2017 at 13:49
@satnam That what mine question how operations like join will perform better in spark than RelationDB like MySQL where we can use index also but spark first we have to perform extra task like sorting and find (just an example) ? — user3198603
– user3198603, Commented Jun 9, 2017 at 14:10

rogue-one · Accepted Answer · 2017-06-09 14:37:33Z

1

Lets consider we have two tables Customers and Orders and each has 100 million records.

Now we have to join these two tables on the column customer_id in both Customer and Order table to generate a report, it is close to impossible to do it MySQL because a single system has to perform this join on a huge volume of data.

On a Spark Cluster we can repartition these tables based on the join column. The data of both the dataframes are distributed now by hashing the customer_id. so this means both the orders and customers table has all the data for a single customer in the same worker node of spark and it can be perform a local join as shown below in the snippet.

val customerDf = //
val orderDf = //
val df1 = customerDf.repartition($"customer_id")
val df2 = orderDf.repartition($"customer_id")
val result df1.join(df2).on(df1("customer_id") == df2("customer_id"))

So this 100 million record join is now performed in parallel across tens or hundreds of worker nodes as opposed to be being done in a single node as in the case of MySQL.

answered Jun 9, 2017 at 14:37

rogue-one

11.6k8 gold badges56 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3198603 Over a year ago

Agreed that repartition by hashing the customer_id will help but even than we wont be able to use column index while df1.join(df2).on(df1("customer_id") == df2("customer_id")) that we can use in db . Is n't it ? I think to use the join efficient, spark has to do sorting(like inverted index) and then join which may be costly. Right?

rogue-one Over a year ago

column indexes are helpful mostly for lookups. they are rarely used for joins especially when you have to join all or most of the data in the table. so Indexes will play little role here.

user3198603 Over a year ago

I believe even when you have to join all or most of the data in the table , indexes will be used. Say you you have picked first order which belongs to customer with id 100. Now if customer_id is not indexed , DBMS has to search that customer iteratively . So i think index will be helpful even in full join\

rogue-one Over a year ago

yeah index will used here because you are doing a lookup (cust_id 100). but there are no lookups and only purely joins, then no index will be used. moreoever spark supports ORC, Parquet data formats that supports columnar data with indexes. \

user3198603 Over a year ago

Thanks rogue-one . I think I was not clear in mine previous comment. What i meant say i want to fetch all customers name who gave orders. My query will be like this select * from order,customer where order.cust_id = customer.cust_id . Now if there is no index on cust_id column on customer table, say DBMS has picked first order from order table whose cust_id =100, now to execute join (order.cust_id = customer.cust_id )i.e. to fetch 100th cust_id from customer table , it will use the index created on cust_id column on customer table. Right ?

Collectives™ on Stack Overflow

Apache Spark makes SQL query faster?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related