NextGen4IT: January 2017

To connect databases through Unix

Add Comment

Unix Oracle Command

SURESH BABU KUNKU@SURESH ~

$ sqlplus system/suresh@xe

SQL*Plus: Release 11.2.0.2.0 Production on Sun Jan 29 09:35:32 2017

Connected to:

Oracle Database 11g Express Edition Release 11.2.0.2.0 - Production

SQL>

To see the records in side table

SQL> select * from table_name;

To come out of SQL prompt

SQL> exit

Disconnected from Oracle Database 11g Express Edition Release 11.2.0.2.0 - Production

Some example to send the sql file to mail through Unix command

cat myscript.sh

#!/bin/bash

PATH=./u03/appl/home/suresh:${PATH}

sqlplus arbor/arbor123@f1pognl <<EOS

SPOOL /u03/appl/farbat1/suresh/spool.out

setlinesize 2000

set wrap off

select sum(3+7) from dual;

SPOOL OFF

exit

EOS

mail -s "sql file" sureshbabuk782@gmail.com< /u03/appl/home/suresh/spool.out

Unix MYSQL command

SURESH BABU KUNKU@SURESH ~

$ mysql -u username -p

To list the databases which exist, or:

mysql> USE dbname;

...to begin using the database named dbname, or:

mysql> SELECT  *  from  table_name;

...to display the tables_name of all records.

To end your mysql session and return to the shell prompt, use the command:

mysql> QUIT

Running MySQL Commands From A Batch Script

Instead of using mysql interactively, you can execute MySQL statements from a scriptfile. For instance, if you have a text file named mysqlscript.txt containing MySQL commands, one per line, you could use this command:

mysql -u username -p db_name < mysqlscript.txt > output.txt

Mailx command

Add Comment

Once installed, the mailx command can be directly referenced with the name mail, so you just type in that in the command line.

1. Simple mail

Run the following command, and then mailx would wait for you to enter the message of the email. You can hit enter for new lines. When done typing the message, press Ctrl+D and mailx would display EOT.

After than mailx automatically delivers the email to the destination.

$ mail -s "This is the subject" someone@example.comHi someoneHow are youI am fineByeEOT

2. Take message from a file

The message body of the email can be taken from a file as well.

$ mail -s "This is Subject" someone@example.com < /path/to/file

The message can also be piped using the echo command -

$ echo "This is message body" | mail -s "This is Subject" someone@example.com

3. Multiple recipients

To send the mail to multiple recipients, specify all the emails separated by a comma

$ echo "This is message body" | mail -s "This is Subject" someone@example.com,someone2@example.com

4. CC and BCC

The "-c" and "-b" options can be used to add CC and BCC addresses respectively.

$ echo "This is message body" | mail -s "This is Subject" -c ccuser@example.com someone@example.com

5. Specify From name and address

To specify a "FROM" name and address, use the "-r" option. The name should be followed by the address wrapped in "<>".

$ echo "This is message body" | mail -s "This is Subject" -r "Harry<harry@gmail.com>" someone@example.com

6. Specify "Reply-To" address

The reply to address is set with the internal option variable "replyto" using the "-S" option.

# replyto email$ echo "This is message" | mail -s "Testing replyto" -S replyto="mark@gmail.com" someone@example.com # replyto email with a name$ echo "This is message" | mail -s "Testing replyto" -S replyto="Mark<mark@gmail.com>" someone@example.com

7. Attachments

Attachments can be added with the "-a" option.

$ echo "This is message body" | mail -s "This is Subject" -r "Harry<harry@gmail.com>" -a /path/to/file someone@example.com

60 Finger

In Unix, finger is a program you can use to find information about computer users. It usually lists the login name, the full name, and possibly other details about the user you are fingering. These details may include the office location and phone number (if known), login time, idle time, time mail was last read, and the user's plan and project files. The information listed varies, and you may not be able to get any information from some sites.

[/u03/appl/farbat1/suresh]$ finger -p <ch>

Directory: /home_ldap/ankloe Shell: /bin/bash

Never logged in.

No unread mail

Directory: /home_ldap/bjholt Shell: /bin/bash

Never logged in.

No unread mail

Directory: /home/roleem Shell: /bin/bash

Never logged in.

No unread mail

[/u03/appl/farbat1/suresh]$ finger -p suresh

Directory: /home_ldap/skunku Shell: /bin/bash

On since Sep 28 12:00:30 on pts/2 from nlamsp6xen152.upcit.ds.upc.biz

No unread mail

[/u03/appl/farbat1/suresh]$

SQL queries

Add Comment

Tables

This is a query to get all Oracle tables that can be viewed by the current user.

select TABLE_NAME, OWNER from SYS.ALL_TABLES order by OWNER, TABLE_NAME

The query can be filtered to return tables for a given schema by adding a where OWNER = 'some_schema' clause to the query.

Schemas

This is a query to get all Oracle schemas in an Oracle database instance.

select USERNAME from SYS.ALL_USERS order by USERNAME

Views

This is a query to get all Oracle views that can be viewed by the current user.

select VIEW_NAME, OWNER from SYS.ALL_VIEWS order by OWNER, VIEW_NAME

The query can be filtered to return views for a specific schema by adding a where OWNER = 'some_schema' clause to the query.

Packages

This is a query to get all Oracle packages that can be viewed by the current user.

select OBJECT_NAME, OWNER from SYS.ALL_OBJECTS where UPPER(OBJECT_TYPE) = 'PACKAGE' order by OWNER, OBJECT_NAME

To query for package bodies, substitute PACKAGE BODY for PACKAGE.

The query can be filtered to return packages for a specific schema by adding a where OWNER = 'some_schema' clause to the query.

Procedures

This is a query to get all Oracle procedures that can be viewed by the current user.

select OBJECT_NAME, OWNER from SYS.ALL_OBJECTS where upper(OBJECT_TYPE) = upper('PROCEDURE') order by OWNER, OBJECT_NAME

The query can be filtered to return procedures for a specific schema by adding a where OWNER = 'some_schema' clause to the query.

Procedure Columns

This is a query to get the columns in an Oracle procedure.

select OWNER, OBJECT_NAME, ARGUMENT_NAME, DATA_TYPE, IN_OUT from SYS.ALL_ARGUMENTS order by OWNER, OBJECT_NAME, SEQUENCE

Functions

This is a query to get all Oracle functions for the current user.

select OBJECT_NAME, OWNER from SYS.ALL_OBJECTS where upper(OBJECT_TYPE) = upper('FUNCTION') order by OWNER, OBJECT_NAME

The query can be filtered to return functions for a specific schema by adding a where OWNER = 'some_schema' clause to the query.

Triggers

This is a query to get all Oracle triggers for the current user.

select TRIGGER_NAME, OWNER from SYS.ALL_TRIGGERS order by OWNER, TRIGGER_NAME

The query can be filtered to return triggers for a specific schema by adding a where OWNER = 'some_schema' clause to the query.

Indexes

This is a query to get all Oracle indexes.

select INDEX_NAME, TABLE_NAME, TABLE_OWNER from SYS.ALL_INDEXES order by TABLE_OWNER, TABLE_NAME, INDEX_NAME

Hadoop admin interview question and answers

1 Comment

Which operating system(s) are supported for production Hadoop deployment?

The main supported operating system is Linux. However, with some additional software Hadoop can be

deployed on Windows.

What is the role of the namenode?

The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks

on the system based on the replication policy. The namenode also supplies the specific addresses for the

data based on the client requests.

What happen on the namenode when a client tries to read a data file?

The namenode will look up the information about file in the edit file and then retrieve the remaining

information from filesystem memory snapshot. Since the namenode needs to support a large number of

the clients, the primary namenode will only send information back for the data location. The datanode

itselt is responsible for the retrieval.

What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and

datanodes)?

There are no requirements for datanodes. However, the namenodes require a specified amount of RAM

to store filesystem image in memory Based on the design of the primary namenode and secondary

namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to

have enough memory to contain the entire filesystem image.

What mode(s) can Hadoop code be run in?

Hadoop can be deployed in stand alone mode, pseudodistributed

mode or fullydistributed

mode.

Hadoop was specifically designed to be deployed on multinode

cluster. However, it also can be deployed

on single machine and as a single process for testing purposes

How would an Hadoop administrator deploy various components of Hadoop in production?

Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on

multiple slave nodes. There is a need for only one namenode and jobtracker on the system. The number

of datanodes depends on the available hardware

What is the best practice to deploy the secondary namenode

Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be

deployed on a separate machine. It will not interfere with primary namenode operations in this way. The

secondary namenode must have the same memory requirements as the main namenode.

Is there a standard procedure to deploy Hadoop?

No, there are some differences between various distributions. However, they all require that Hadoop jars

be installed on the machine. There are some common requirements for all Hadoop distributions but the

specific procedures will be different for different vendors since they all have some degree of proprietary

software

What is the role of the secondary namenode?

Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem

snapshots. The secondary namenode was separated out as a process due to having CPU intensive

operations and additional requirements for metadata backup

What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary

namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also,

the system will go into safemode for an extended time since the namenode needs to combine the edit log

and the current filesystem checkpoint image.

What happen if a datanode loses network connection for a few minutes?

The namenode will detect that a datanode is not responsive and will start replication of the data from

remaining replicas. When datanode comes back online, the extra replicas will beThe replication factor is

actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps

track which blocks are located on that node. The moment the datanode is not avaialble it will trigger

replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated

data will be deleted. Note: the data might be deleted from the original datanode.

What happen if one of the datanodes has much slower CPU?

The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the

slowest worker will not have such big impact Hadoop was specifically designed to work with commodity

hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same

task will be created and job tracker will take the first result into consideration and the second instance of

the task will be killed.

What is speculative execution?

If speculative execution is enabled, the job tracker will issue multiple instances of the same

task on multiple nodes and it will take the result of the task that finished first. The other

instances of the task will be killed.

The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker

creates multiple instances of the same task and takes the result of the first successful task. The rest of the

tasks will be discarded.

How many racks do you need to create an Hadoop cluster in order to make sure that the cluster

operates reliably?

In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement

configured Hadoop has a builtin

rack awareness mechanism that allows data distribution between

different racks based on the configuration.

Are there any special requirements for namenode?

Yes, the namenode holds information about all files in the system and needs to be extra reliable. The

namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in

multiple places. Note that the community is working on solving the single point of failure issue with the

namenode.

If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the

cluster that will correspond to that file (assuming the default apache and cloudera configuration)?

Based on the configuration settings the file will be divided into multiple blocks according to the default

block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings

(default 3). 2 * 3 = 6 .

What is distributed copy (distcp)?

Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a

large amount of data. One of the major challenges in the Hadoop enviroment is copying data across

multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

What is replication factor?

Replication factor controls how many times each individual block can be replicated –

Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor

guarantees data availability in the event of failure.

What daemons run on Master nodes?

NameNode, Secondary NameNode and JobTracker

Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM.

NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker

run on each Slave nodes.

What is rack awareness?

Rack awareness is the way in which the namenode decides how to place blocks based on the rack

definitions. Hadoop will try to minimize the network traffic between datanodes within the same rack and

will only contact remote racks if it has to. The namenode is able to control this due to rack awareness

What is the role of the jobtracker in an Hadoop cluster?

The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks.

The job tracker is the main component of the mapreduce

execution. It control the division of the job into

smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results

back to calling code.

How does the Hadoop cluster tolerate datanode failures?

Since Hadoop is design to run on commodity hardware, the datanode failures are expected.

Namenode keeps track of all available datanodes and actively maintains replication factor

on all data.

The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become

nonresponsive.

The namenode is the central "brain" of the HDFS and starts replication of the data the

moment a disconnect is detected.

What is the procedure for namenode recovery?

A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting

secondary namenode to primary namenode.

The namenode recovery procedure is very important to ensure the reliability of the data.It can be

accomplished by starting a new namenode using backup data or by promoting the secondary namenode

to primary.

WebUI

shows that half of the datanodes are in decommissioning mode. What does that mean? Is

it safe to remove those nodes from the network?

This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining

datanodes. There is a possibility that data can be lost if administrator removes those datanodes before

decomissioning finished .

Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to

completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data

from datanodes by moving replicas to remaining datanodes.

What does the Hadoop administrator have to do after adding new datanodes to the Hadoop

cluster?

Since the new nodes will not have any data on them, the administrator needs to start the balancer to

redistribute data evenly between all nodes.

Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster

performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.

If the Hadoop administrator needs to make a change, which configuration file does he need to

change?

Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in

every file. One of the reasons for this is that configuration can be different for every node.

Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What

could be wrong?

The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before

restarting the jobs again

This is a very common mistake by Hadoop administrators when there is no secondary namenode on the

cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and

combine the edit log and current file system timestamp

Map Reduce jobs take too long. What can be done to improve the performance of the cluster?

One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the

tasks. The number tasks has to match the number of available slots on the cluster

Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators

to make sure that the resource supply and demand match.

How often do you need to reformat the namenode?

Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will

lead to lost of the data on entire

The namenode is the only system that needs to be formatted only once. It will create the directory

structure for file system metadata and create namespaceID for the entire file system.

After increasing the replication level, I still see that data is under replicated. What could be wrong?

Data replication takes time due to large quantities of data. The Hadoop administrator should allow

sufficient time for data replication

Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy

data around and if data size is big enough it is not uncommon that replication will take from a few minutes

to a few hours.