Buckets(or Clusters):


Buckets should be based upon as Queries to be used in Hive. Buckets are used to reduce the number of input files for processing your query. 

 

Tables/Partitions can be further subdivided into Clusters or Buckets.


Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.

For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table.


Bucketing is helpful for 2 reasons :


•Enables more efficient queries

•Makes sampling more efficient

•Hash(column) MOD(number of buckets) –evenly distributed


Code:

CREATE TABLE students_bucket(name STRING,id INT,college STRING) PARTITIONED BY(country STRING) CLUSTERED BY (college) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;

set hive.enforce.bucketing=true;

INSERTING

INSERT OVERWRITE TABLE employees_bucketPARTITION(country=‘USA’) SELECT name,id,college from college.students_external;

SAMPLING

SELECT * FROM students_bucket TABLESAMPLE(BUCKET 2 OUT OF 4);


Let discuss one more example :


When you do bucketing on your data based on particular column(userid), your data is stored in each bucket file is based on hashing of key modulus no of buckets.


For example, if you have 10 records in data in the following format.(original )


1,abhishek

2,kumar

3,siva

4,senthil2

5,abhi

6,vignesh

7,peter

8,stefen

9,doug

10,alan


if you do buckets as 4 on the above data, your (partition) dir in hdfs will have four files wrt each bucket

bucket-0 will contain data

4,senthil2

8,stefen

buckte-1 will contain data

1,abhishek

5,abhi

9,doug


It goes based on remainder of your userid


Coming to Query. "Select * from table1 where userid = 4


How many files will be processed?


 Only 1 ie. bucket-0 file


It turn we reduce the number of files for MR using Hive.


We can do bucketing on more number of columns based on frequency of the columns in where clause of your queries.


Note: used 10 records just for explanation onlyBuckets can be used even without partition.


We hope this will help you to understand the concept.