What is a SerDe.


The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

while selecting the data from Apache Hive SerDe.deserialize() method is called and while inserting the data SerDe.serialize()method is called.Initialize() method is called only once and gathers some commonly-used pieces of information from the table properties, such as the column names and types.

In a SerDe, the serialize() method takes a Java object representing a row of data, and converts that object into a serialized representation of the row. The serialized class is determined by the return type of getSerializedClass(). In the JSONSerDe, the serialize() method converts the object into a JSON string represented by a Text object. 

 

Also have a look on the attached document for your reference.

 

We also have a sample project using Hive Serde.

 

Sample Project :  https://edureka.wistia.com/medias/gxot80fo6o

 

Solution:

 

Step:1

 

Replace semicolon with tab and remove double quotes in all the three datasets (BX-Books.csv,BX-Book-Ratings.csv,BX-Users.csv) and move the files to edureka vm

 

Step:2

 

Create a database as "Project"

create database project;

 

Step:3

 

Add serde to remove semicolon and double quote from dataset

add jar /home/edureka/csv-serde-0.9.1.jar;

 

Step:4

 

Create tables "books","book_ratings","users" to store the data from BX-Books.csv,BX-Book-Ratings.csv,BX-Users.csv respectively.

create table books(

isbn string,

book_title string,

book_author string,

year_of_publication string,

publisher string,

image_url_s string,

image_url_m string,

image_url_l string) 

row format serde 'com.bizo.hive.serde.csv.CSVSerde'

with serdeproperties(

"separatorChar" = "\;",

"quoteChar" = "\"")

stored as textfile;

create table book_ratings(

user_id string,

isbn string,

book_rating string)

row format serde 'com.bizo.hive.serde.csv.CSVSerde'

with serdeproperties(

"separatorChar" = "\;",

"quoteChar" = "\"")

stored as textfile;

create table users(

user_id string,

location string,

age string)

row format serde 'com.bizo.hive.serde.csv.CSVSerde'

with serdeproperties(

"separatorChar" = "\;",

"quoteChar" = "\"")

stored as textfile;

 

Step:5

 

Load all the three datasets (BX-Books.csv,BX-Book-Ratings.csv,BX-Users.csv) to hdfs

load data local inpath '/home/edureka/BX-Books.csv' into table books;

load data local inpath '/home/edureka/BX-Book-Ratings.csv' into table book_ratings;

load data local inpath '/home/edureka/BX-Users.csv' into table users;

Step:6

 

Finding out the frequency of books published each year

select year_of_publication,count(1)

from books

group by year_of_publication;

 

Step:7

 

Finding out in which year maximum number of books has been sold

select b.year_of_publication,count(1) Number_of_books_sold

from books b

join book_ratings br on b.isbn = br.isbn

group by b.year_of_publication

order by Number_of_books_sold desc limit 1;

 

Step:8

 

Finding out how many books were published based on ranking in the year 2002

select b.year_of_publication, br.book_rating, count(1)

from books b

join book_ratings br on b.isbn = br.isbn

group by b.year_of_publication,br.book_rating

having b.year_of_publication = 2002;


Please try and let us know if you have any other issue so we can help you out here.
180372