What is what is distributed by in
apache hive?
Cluster By and Distribute By are
used mainly with the Transform/Map-Reduce
Scripts. But, it is sometimes useful in SELECT statements if there
is a need to partition and sort the output of a query for subsequent queries.
Cluster By is a short-cut for
both Distribute By and Sort By.
Hive uses the columns in Distribute
By to distribute the rows among reducers. All rows with the same Distribute
By columns will go to the same reducer. However, Distribute By does
not guarantee clustering or sorting properties on the distributed keys.
For example, we are Distributing
By x on the following 5 rows to 2 reducers:
x1
x2
x4
x3
x1
|
Reducer 1 got
x1
x2
x1
|
Reducer 2 got
x4
x3
|
Note that all rows with the same key
x1 is guaranteed to be distributed to the same reducer (reducer 1 in this
case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By
x, the two reducers will further sort rows on x:
Reducer 1 got
x1
x1
x2
|
Reducer 2 got
x3
x4
|
Instead of specifying Cluster By,
the user can specify Distribute By and Sort By, so the partition
columns and sort columns can be different. The usual case is that the partition
columns are a prefix of sort columns, but that is not required.
What is Hive's Partitioning
A simple query in Hive reads the
entire dataset even if we have where clause filter. This becomes a bottleneck
for running MapReduce jobs over a large table. We can overcome this issue by
implementing partitions in Hive. Hive makes it very easy to implement
partitions by using the automatic partition scheme when the table is created.
In Hive’s implementation of
partitioning, data within a table is split across multiple partitions. Each
partition corresponds to a particular value(s) of partition column(s) and is
stored as a sub-directory within the table’s directory on HDFS. When the table
is queried, where applicable, only the required partitions of the table are
queried, thereby reducing the I/O and time required by the query.
What is oozie ? and how to
configure in apache oozie?
Oozie is a workflow scheduler system to manage Apache Hadoop
jobs. Oozie Workflow jobs
are Directed Acyclical Graphs (DAGs) of actions. OozieCoordinator jobs
are recurrent Oozie Workflow jobs
triggered by time (frequency) and data availability. ...Oozie is a
scalable, reliable and extensible system.