What is what is distributed by in apache hive?

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducers:

Reducer 1 got

Reducer 2 got

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

Reducer 2 got

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

What is Hive's Partitioning

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.

What is oozie ? and how to configure in apache oozie?

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. OozieCoordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. ...Oozie is a scalable, reliable and extensible system.

Now a day interviewers not asking direct question maximum they are following real time issue wise scenarios.

1 1. I have 4mb hive jar file and that is set crontab to run daily bases. After 6 month we got a
disk space issue while running the job. But in sever disk space available.

Where we are getting disk space issue and how to resolve that.?

Ans:

This might be the Heap size Error (Expected Reason).
This can be avoid by
1. Providing maximum heap size to tasks.
set mapreduce.map.memory.mb=5120;

 
 set mapreduce.map.java.opts=-Xmx4608m;

   set
mapreduce.reduce.memory.mb=5120;

set mapreduce.reduce.java.opts=-Xmx4608m;

2. Clear the executor logs on periodic basis along with the job execution (with help of Shell Script & CronTab).

I guess if its HDFS, one possible answer could be space/quota allocated to a folder/user exhausted while space in cluster/server is available

or some tmp/staging space is not cleaned up.

2 What are the complex data types in PIG?

Complex Types: Pig supports three complex data types. They are listed below:

Tuple : An ordered set of fields. Tuple is represented by braces. Example: (1,2)

Bag : A set of tuples is called a bag. Bag is represented by flower or curly braces. Example: {(1,2),(3,4)}

Map : A set of key value pairs. Map is represented in a square brackets. Example: [key#value] . The # is used to separate key and value.

3 How to pass data run time in Hive query crontab ?

Like Pig and other scripting languages, Hive provides you with the ability to create parameterized scripts – greatly increasing the re-usability of the scripts. To take advantage, write your Hive scripts like this:

select yearid, sum(HR)

from batting_stats

where teamid = '${hiveconf:TEAMID}'

group by yearid

order by yearid desc;

Note that the restriction on teamid is ‘${hiveconf:TEAMID}’ rather than an actual value. This is an instruction to read this variable’s value from the hiveconf namespace. When you execute the script, you’ll run it as shown below:

hive -f batting.hive -hiveconf TEAMID='LAA'

If you define the parameter in the script but fail to specify a value at run-time, you won’t get any error like you would with Pig. Instead, the restriction effectively becomes “where teamid = ””. If you have blanks then you might get a result back; if not, you’ll go through all the necessary mechanics of executing the script sans the results.

NextGen4IT

Infosys Hadoop interview questions

What is what is distributed by in apache hive?

What is Hive's Partitioning

What is oozie ? and how to configure in apache oozie?

Accenture hadoop interview questions