Infosys Hadoop interview questions


What is what is distributed by in apache hive?

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.
Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
For example, we are Distributing By x on the following 5 rows to 2 reducers:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
x1
x1
x2
Reducer 2 got
x3
x4
Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

What is Hive's Partitioning

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.

What is oozie ? and how to configure in apache oozie?


Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. OozieCoordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. ...Oozie is a scalable, reliable and extensible system.

22 comments

Hi, Your post is quite great to view and easy way to grab the extra knowledge. Thank you for your share with us. I like to visit your site again for my future reference.
Cloud computing Training Chennai
Cloud computing Training centers in Chennai
Cloud computing Training institutes in Chennai
Best Cloud computing Training in Chennai
Cloud computing institutes in Chennai

Thanks for sharing,this blog makes me to learn new thinks.
interesting to read and understand.keep updating it.
Java Training
Best Java Training Institute in Annanagar
Java Training in Guindy
Java Courses in Sholinganallur

This comment has been removed by the author.

Amazing information,thank you for your ideas.after along time i have studied an interesting information's.we need more updates in your blog.
AWS training courses near me
AWS Training in anna nagar
AWS Training Institutes in Vadapalani

I am really enjoying reading your well written articles.
It looks like you spend a lot of effort and time on your blog.
I have bookmarked it and I am looking forward to reading new articles. Keep up the good work..
Advanced Java Training Institute in Bangalore
Best Institute For Java Course in Bangalore
Java Training Classes in Bangalore
Java Training Courses in Bangalore
Best Institute For Java Training In Bangalore

This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me..

mail1 of 14BacklinksInboxxBalaji hope tutors Attachments3:33 PM (2 hours ago)to me

--
M Balaji
Digital Marketing Analyst
Marketing
HopeTutorsm:7871012233
a:18, HARITHA BUILDING FIRST FLOOR, JANAKPURI FIRST STREET, VELACHERY, Chennai, Tamilnaduw:www.hopetutors.com e: balaji.hopetutors@gmail.com
Contact Us

2 AttachmentsPreview attachment Links.txt [Text] Preview attachment Magi.xlsx [Excel] Thanks a lot.Received, thank you.Thanks, I'll check them out.
AWS Training in Chennai
Blue Prism Training in Chennai
Angular JS Training in Chennai
Uipath Training in Chennai


EmoticonEmoticon