Difference between map side join and normal join

 



This figure represents the normal join operation. When we apply normal join operation, the job will be compiled to a MapReduce task which involves a map stage and a reduce stage. A mapper will read from join tables and emits the join key and join value pair into an intermediate file. Hadoop sorts and merges these pairs in the shuffle stage. The reducer takes the sorted results as input and does the actual join work.

The shuffle stage is really expensive since it needs to sort and merge. So there is a necessity to save the shuffle and reduce stages which will also improve the task performance. Hence map side joins were introduced. They save the shuffle and reduce stages by doing the join work only in the map stage. By doing so, when one of the join tables is small enough to fit into the memory, all the mappers can hold the data in memory and do the join work there itself. In this way all the join operations are finished in the map stage itself which even reduced the time to complete the job.

Note that map side joins will be suitable when one of the tables for which you perform join operation should be small.  

 

 

Simple Example for Map Reduce Joins:

 

create table emp ( name string, id bigint, deptid bigint) row format delimited fields terminated by',';

 

create table dept( deptname string, deptid bigint) row format delimited fields terminated by',';

 

load data local inpath '/home/cloudera/employee.txt' overwrite into table emp;

 

 

load data local inpath '/home/cloudera/dept.txt' overwrite into table dept;

 

select /*+ MAPJOIN(dept) */ emp.name, dept.deptname

from emp join dept on emp.deptid=dept.deptid;

 

 

select emp.name, dept.deptname

from emp join dept on emp.deptid=dept.deptid;

 

The content of file employee.txt is

 

pranav,1,1

akshay,2,1

ananth,3,2

akash,4,2

anand,6,2

 

 

The content of file dept.txt is

 
support,1
sales,2

 

In the above example we have two tables one is ‘emp’ which stores the name, id and department id of the employee and the other is ‘dept’ which stores the name and id of the department. So I want to find which employee is present in which department. To show the difference I have performed the map join and join on the tables. While executing both joins you can find the difference that the job is completed in a less amount of time and also  without reducer in map side join whereas the normal join executes this job with the help of one reducer at least.