Run multiple reducers on single output from mapper

Question

I am implementing a left join functionality using map reduce. Left side is having around 600 million records and right side is having around 23 million records. In mapper I am making the keys using the columns used in left join condition and passing the key-value output from mapper to reducer. I am getting performance issue because of few mapper keys for which number of values in both the tables are high (eg. 456789 and 78960 respectively). Even though other reducers finish their job, these reducers keep running for longer time. Is there any way that multiple reducers can work on the same key-value output from mapper in parallel to better the performance?

This is the Hive query that i want to optimize.

select distinct 
        a.sequence, 
        a.fr_nbr, 
        b.to_nbr, 
        a.fr_radius,
        a.fr_zip, 
        a.latitude as fr_latitude, 
        a.longitude as fr_longitude, 
        a.to_zip, 
        b.latitude as to_latitude, 
        b.longitude as to_longitude,
        ((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371) as distance,
        a.load_year, 
        a.load_month
from common.sb_p1 a LEFT JOIN common.sb__temp0u b    
        on a.to_zip=b.zip
            and a.load_year=b.load_year
            and a.load_month=b.load_month
where   b.correction = 0 
        and a.fr_nbr <> b.to_nbr 
        and ((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371 <= a.fr_radius)

Any other solution will also be appreciated.

What type of join you are doing ? Map-side (replicated) or reduce-side (repartition) ? — Nicomak
– Nicomak, Commented Oct 18, 2016 at 5:55
If you know your keys, you can write custom partition for better performance. Exp: If key.value<78960 .... else .... tutorialspoint.com/map_reduce/map_reduce_partitioner.htm — pckmn
– pckmn, Commented Oct 18, 2016 at 6:34

leftjoin · Accepted Answer · 2016-10-18 09:03:31Z

1

Split the skewed keys using UNION ALL:

select * from table1 a left join table2 b on a.key=b.key
where a.key not in (456789,78960)
union all
select * from table1 a left join table2 b on a.key=b.key
where a.key = 456789
union all
select * from table1 a left join table2 b on a.key=b.key
where a.key = 78960
;

These subqueries will run in parallel, skewed keys will not be distributed to single reducer

answered Oct 18, 2016 at 9:03

leftjoin

38.5k8 gold badges64 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pushkin · Accepted Answer · 2016-10-18 05:45:43Z

0

You can also consider using HiveQL for this. Its pretty much meant for situations like the one you have mentioned above and takes care of complexity of map reduce implementation.

answered Oct 18, 2016 at 5:45

Pushkin

5344 silver badges18 bronze badges

1 Comment

Sumit Bharati Over a year ago

Currently I am using HiveQL and its taking around 48 to 50 hrs of time to finish. Thats was the reason I wanted to try it in custom Map Reduce program.

Collectives™ on Stack Overflow

Run multiple reducers on single output from mapper

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related