2

I am implementing a left join functionality using map reduce. Left side is having around 600 million records and right side is having around 23 million records. In mapper I am making the keys using the columns used in left join condition and passing the key-value output from mapper to reducer. I am getting performance issue because of few mapper keys for which number of values in both the tables are high (eg. 456789 and 78960 respectively). Even though other reducers finish their job, these reducers keep running for longer time. Is there any way that multiple reducers can work on the same key-value output from mapper in parallel to better the performance?

This is the Hive query that i want to optimize.

select distinct 
        a.sequence, 
        a.fr_nbr, 
        b.to_nbr, 
        a.fr_radius,
        a.fr_zip, 
        a.latitude as fr_latitude, 
        a.longitude as fr_longitude, 
        a.to_zip, 
        b.latitude as to_latitude, 
        b.longitude as to_longitude,
        ((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371) as distance,
        a.load_year, 
        a.load_month
from common.sb_p1 a LEFT JOIN common.sb__temp0u b    
        on a.to_zip=b.zip
            and a.load_year=b.load_year
            and a.load_month=b.load_month
where   b.correction = 0 
        and a.fr_nbr <> b.to_nbr 
        and ((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371 <= a.fr_radius)

Any other solution will also be appreciated.

3
  • What type of join you are doing ? Map-side (replicated) or reduce-side (repartition) ? Commented Oct 18, 2016 at 5:55
  • If you know your keys, you can write custom partition for better performance. Exp: If key.value<78960 .... else .... tutorialspoint.com/map_reduce/map_reduce_partitioner.htm Commented Oct 18, 2016 at 6:34
  • @Nicomak I am using reduce side join. Commented Oct 18, 2016 at 14:43

2 Answers 2

1

Split the skewed keys using UNION ALL:

select * from table1 a left join table2 b on a.key=b.key
where a.key not in (456789,78960)
union all
select * from table1 a left join table2 b on a.key=b.key
where a.key = 456789
union all
select * from table1 a left join table2 b on a.key=b.key
where a.key = 78960
;

These subqueries will run in parallel, skewed keys will not be distributed to single reducer

Sign up to request clarification or add additional context in comments.

Comments

0

You can also consider using HiveQL for this. Its pretty much meant for situations like the one you have mentioned above and takes care of complexity of map reduce implementation.

1 Comment

Currently I am using HiveQL and its taking around 48 to 50 hrs of time to finish. Thats was the reason I wanted to try it in custom Map Reduce program.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.