1

I have two huge tables in Hive. 'table 1' and 'table 2'. Both table has a common column 'key'.

I have queried 'table 1' with the desired conditions and created a DataFrame 'df1'. Now, I want to query 'table 2' and want to use a column from 'df1' in the where clause.

Here is the code sample:

val df1 = hiveContext.sql("select * from table1 limit 100")

Can I do something like

val df2 = hiveContext.sql("select * from table2 where key = df1.key")

** Note : I don't want to make a single query with joining both tables

Any help will be appreciated.

1
  • 3
    what you're asking for is a join :) Commented May 17, 2016 at 6:06

2 Answers 2

2

Since you have explicitly written that you do NOT want to join the tables, then the short answer is "No, you cannot do such a query".

I'm not sure why you don't want to do the join, but it is definitely needed if you want to do the query. If you are worried about joining two "huge tables", then don't be. Spark was build for this kind of thing :)

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for the answer. I am using Spark since more than an year now. I tried joining but it results in 2 million + task in the job that is taking ages to complete. That's why I wanted a way without join and kind of lookup because the count of first df can be controlled.
Moreover I tried repartitioning but I am using 1.5 and you cannot repartition on a column , which has been introduced in 1.6.. I guess I have to use 1.6 and repartition on the joining key and then check how much time join is taking
I has partitioned both df (1000), so the new job has 1 million task... I left the code to run overnight, I just checked, 34 task has been completed in 7 hours, with 84 TB of data being processed. That's why I did not want to read entire data set and then join.
I'm thinking that 1000 partitions is not that much if you har 84TB of data. Also how big is your cluster and what kind of machine instances are you running on? Are you ONLY joining or are you trying to do something else as well (like grouping or simular)? What does your launch command look like?
If reference table is small then it should be expressed as a broadcast join without a shuffle
|
0

The solution that I found is the following

Let me first give the dataset size.

Dataset1 - pretty small (10 GB)
Dataset2 - big (500 GB+)

There are two solutions to dataframe joins

Solution 1 If you are using Spark 1.6+, repartition both dataframes by the
column on which join has to be done. When I did it, the join was done in less than 2 minutes.

df.repartition(df("key"))

Solution 2 If you are not using Spark 1.6+ (even if using 1.6+), if one data is small, cache it and use that in broadcast

df_small.cache
df_big.join(broadcast(df_small) , "key"))

This was done in less than a minute.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.