Infosys Hadoop interview questions

Infosys Hadoop interview questions


What is what is distributed by in apache hive?

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.
Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
For example, we are Distributing By x on the following 5 rows to 2 reducers:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
x1
x1
x2
Reducer 2 got
x3
x4
Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

What is Hive's Partitioning

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.

What is oozie ? and how to configure in apache oozie?


Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. OozieCoordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. ...Oozie is a scalable, reliable and extensible system.