Can I use another dataframe column to query spark sql

Question

I have two huge tables in Hive. 'table 1' and 'table 2'. Both table has a common column 'key'.

I have queried 'table 1' with the desired conditions and created a DataFrame 'df1'. Now, I want to query 'table 2' and want to use a column from 'df1' in the where clause.

Here is the code sample:

val df1 = hiveContext.sql("select * from table1 limit 100")

Can I do something like

val df2 = hiveContext.sql("select * from table2 where key = df1.key")

** Note : I don't want to make a single query with joining both tables

Any help will be appreciated.

what you're asking for is a join :)

Zahiro Mor
– Zahiro Mor

2016-05-17 06:06:30 +00:00
Commented May 17, 2016 at 6:06 — Zahiro Mor
– Zahiro Mor, Commented May 17, 2016 at 6:06

Glennie Helles Sindholt · Accepted Answer · 2016-05-17 07:49:03Z

2

Since you have explicitly written that you do NOT want to join the tables, then the short answer is "No, you cannot do such a query".

I'm not sure why you don't want to do the join, but it is definitely needed if you want to do the query. If you are worried about joining two "huge tables", then don't be. Spark was build for this kind of thing :)

answered May 17, 2016 at 7:49

Glennie Helles Sindholt

13.2k6 gold badges47 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user2200660 Over a year ago

Thanks for the answer. I am using Spark since more than an year now. I tried joining but it results in 2 million + task in the job that is taking ages to complete. That's why I wanted a way without join and kind of lookup because the count of first df can be controlled.

user2200660 Over a year ago

Moreover I tried repartitioning but I am using 1.5 and you cannot repartition on a column , which has been introduced in 1.6.. I guess I have to use 1.6 and repartition on the joining key and then check how much time join is taking

user2200660 Over a year ago

I has partitioned both df (1000), so the new job has 1 million task... I left the code to run overnight, I just checked, 34 task has been completed in 7 hours, with 84 TB of data being processed. That's why I did not want to read entire data set and then join.

Glennie Helles Sindholt Over a year ago

I'm thinking that 1000 partitions is not that much if you har 84TB of data. Also how big is your cluster and what kind of machine instances are you running on? Are you ONLY joining or are you trying to do something else as well (like grouping or simular)? What does your launch command look like?

zero323 Over a year ago

If reference table is small then it should be expressed as a broadcast join without a shuffle

|

user2200660 · Accepted Answer · 2016-05-18 18:06:33Z

0

The solution that I found is the following

Let me first give the dataset size.

Dataset1 - pretty small (10 GB)
Dataset2 - big (500 GB+)

There are two solutions to dataframe joins

Solution 1 If you are using Spark 1.6+, repartition both dataframes by the
column on which join has to be done. When I did it, the join was done in less than 2 minutes.

df.repartition(df("key"))

Solution 2 If you are not using Spark 1.6+ (even if using 1.6+), if one data is small, cache it and use that in broadcast

df_small.cache
df_big.join(broadcast(df_small) , "key"))

This was done in less than a minute.

answered May 18, 2016 at 18:06

user2200660

1,2813 gold badges18 silver badges24 bronze badges

Collectives™ on Stack Overflow

Can I use another dataframe column to query spark sql

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related