awk - Read a file and
split the contents
awk is one of the most powerful utilities used in the unix
world. Whenever it comes to text parsing, sed and awk do some unbelievable
things. In this first article on awk, we will see the basic usage of awk.
The syntax of awk is:
awk 'pattern{action}' file
where the pattern indicates the pattern or the condition on which the action is to be executed for every line matching the pattern. In case of a pattern not being present, the action will be executed for every line of the file. In case of the action part not being present, the default action of printing the line will be done. Let us see some examples:
Assume a file, say file1, with the following content:
The syntax of awk is:
awk 'pattern{action}' file
where the pattern indicates the pattern or the condition on which the action is to be executed for every line matching the pattern. In case of a pattern not being present, the action will be executed for every line of the file. In case of the action part not being present, the default action of printing the line will be done. Let us see some examples:
Assume a file, say file1, with the following content:
$ cat file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
This file has 2 fields in it. The first field
indicates the name of a person, and the second field denoting their expertise,
the first line being the header record.
1. To print only the names present in the file:
1. To print only the names present in the file:
$ awk '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru
The above awk command does not have any pattern
or condition. Hence, the action will be executed on every line of the file. The
action statement reads "print $1". awk, while reading a file, splits
the different columns into $1, $2, $3 and so on. And hence the first column is
accessible using $1, second using $2, etc. And hence the above command prints
all the names which happens to be first column in the file.
2. Similarly, to print the second column of the file:
2. Similarly, to print the second column of the file:
$ awk '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration
3. In the first example, the list of
names got printed along with the header record. How to omit the header
record and get only the names printed?
$ awk 'NR!=1{print $1}' file1
Deepak
Neha
Vijay
Guru
The above awk command uses a special variable
NR. NR denotes line number ranging from 1 to the actual line count. The
conditon 'NR!=1' indicates not to execute the action part for the first line of
the file, and hence the header record gets skipped.
4. How do we print the entire file contents?
4. How do we print the entire file contents?
$ awk '{print $0}' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
$0 stands for the entire line. And hence when we do
"print $0", the whole line gets printed.
5. How do we get the entire file content printed in other way?
5. How do we get the entire file content printed in other way?
$ awk '1' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
The above awk command has only the pattern or
condition part, no action part. The '1' in the pattern indicates "true"
which means true for every line. As said above, no action part denotes just to
print which is the default when no action statement is given, and hence the
entire file contents get printed.
Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma separated file is called csv file. Assuming the file contents to be:
Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma separated file is called csv file. Assuming the file contents to be:
$ cat file1
Name,Domain,Expertise
Deepak,Banking,MQ Series
Neha,Telecom,Power Builder
Vijay,Finance,CRM Expert
Guru,Migration,Unix
This file contains 3 fields. The new field being the
expertise of the respective person.
6. Let us try to print the first column of this csv file using the same method as mentioned in Point 1.
6. Let us try to print the first column of this csv file using the same method as mentioned in Point 1.
$ awk '{print $1}' file1
Name,Domain,Expertise
Deepak,Banking,MQ
Neha,Telecom,Power
Vijay,Finance,CRM
Guru,Migration,Unix
The output looks weird. Isnt it? We expected
only the first column to get printed, but it printed little more and that too
not a definitive one. If you notice carefully, it printed every line till the
first space is encountered. awk, by default, uses the white space as the
delimiter which could be a single space, tab space or a series of spaces. And
hence our original file was split into fields depending on space.
Since our requirement now involves dealing with a file which is comma separated, we need to specify the delimiter.
Since our requirement now involves dealing with a file which is comma separated, we need to specify the delimiter.
$ awk -F"," '{print
$1}' file1
Name
Deepak
Neha
Vijay
Guru
awk has a command line option "-F' with which we
can specify the delimiter. Once the delimiter is specified, awk splits the file
on the basis of the delimiter specified, and hence we got the names by printing
the first column $1.
7. awk has a special variable called "FS" which stands for field separator. In place of the command line option "-F', we can also use the "FS".
7. awk has a special variable called "FS" which stands for field separator. In place of the command line option "-F', we can also use the "FS".
$ awk '{print $1,$3}'
FS="," file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
8. Similarly, to print the second
column:
$ awk -F, '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration
9. To print the first and third columns, ie., the name and the expertise:
$ awk -F"," '{print $1,
$3}' file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
10. The output shown above is not
easily readable since the third column has more than one word. It would have
been better had the fields being displayed are present with a delimiter. Say,
lets use comma to separate the output. Also, lets discard the header
record.
$ awk -F"," 'NR!=1{print
$1,$3}' OFS="," file1
Deepak,MQ Series
Neha,Power Builder
Vijay,CRM Expert
Guru,Unix
OFS is another awk special
variable. Just like how FS is used to separate the input fields, OFS (Output
field separator) is used to separate the output fields.
awk - Passing arguments or shell variables
to awk
In one of our earlier articles, we saw how to read a file in
awk. At times, we might have some requirements wherein we need to pass some
arguments to the awk program or to access a shell variable or an environment
variable inside awk. Let us see in this article how to pass and access
arguments in awk:
Let us take a sample file with contents, and a variable "x":
Let us take a sample file with contents, and a variable "x":
$ cat file1
24
12
34
45
$ echo $x
3
Now, say we want to add every value with the shell variable
x.
1.awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it.
1.awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it.
$ awk -v val=$x '{print $0+val}'
file1
27
15
37
48
As seen above, the shell variable $x is assigned to
the awk variable "val". This variable "val" can directly be
accessed in awk.
2. awk provides another way of passing argument to awk without using -v. Just before specifying the file name to awk, provide the shell variable assignments to awk variables as shown below:
2. awk provides another way of passing argument to awk without using -v. Just before specifying the file name to awk, provide the shell variable assignments to awk variables as shown below:
$ awk '{print $0,val}' OFS=,
val=$x file1
24,3
12,3
34,3
45,3
3. How to access environment
variables in awk? Unlike shell variables, awk provides a way to access the
environment variables without passing it as above. awk has a special variable
ENVIRON which does the needful.
$ echo $x
3
$ export x
$ awk '{print
$0,ENVIRON["x"]}' OFS=, file1
24,3
12,3
34,3
45,3
Quoting file content:
Some times we might have a requirement wherein we have to quote the file contents. Assume, you have a file which contains the list of database tables. And for your requirement, you need to quote the file contents:
$ cat file
CUSTOMER
BILL
ACCOUNT
4. Pass a variable to awk which contains the double quote. Print the quote, line, quote.
$ awk -v q="'" '{print q
$0 q}' file
'CUSTOMER'
'BILL'
'ACCOUNT'
5. Similarly, to double quote the
contents, pass the variable within single quotes:
$ awk '{print q $0 q}' q='"'
file
"CUSTOMER"
"BILL"
"ACCOUNT"
awk - Match a pattern in a file in Linux
In one of our earlier articles on awk series, we had seen the basic usage of awk or gawk. In this,
we will see mainly how to search for a pattern in a file in awk. Searching
pattern in the entire line or in a specific column.
Let us consider a csv file with the following contents. The data in the csv file contains kind of expense report. Let us see how to use awk to filter data from the file.
Let us consider a csv file with the following contents. The data in the csv file contains kind of expense report. Let us see how to use awk to filter data from the file.
$ cat file
Medicine,200
Grocery,500
Rent,900
Grocery,800
Medicine,600
1. To print only the records
containing Rent:
$ awk '$0 ~ /Rent/{print}' file
Rent,900
~ is the symbol used for pattern
matching. The / / symbols are used to specify the pattern. The above line
indicates: If the line($0) contains(~) the pattern Rent, print the line.
'print' statement by default prints the entire line. This is actually the
simulation of grep command using awk.
2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off as shown below:
2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off as shown below:
$ awk '/Rent/{print}' file
Rent,900
3. Since awk prints the line by
default on a true condition, print statement can also be left off.
$ awk '/Rent/' file
Rent,900
In this example, whenever the line contains Rent,
the condition becomes true and the line gets printed.
4. In the above examples, the pattern matching is done on the entire line, however, the pattern we are looking for is only on the first column. This might lead to incorrect results if the file contains the word Rent in other places. To match a pattern only in the first column($1),
4. In the above examples, the pattern matching is done on the entire line, however, the pattern we are looking for is only on the first column. This might lead to incorrect results if the file contains the word Rent in other places. To match a pattern only in the first column($1),
$ awk -F, '$1 ~ /Rent/' file
Rent,900
The -F option in awk is used to specify
the delimiter. It is needed here since we are going to work on the specific
columns which can be retrieved only when the delimiter is known.
5. The above pattern match will also match if the first column contains "Rents". To match exactly for the word "Rent" in the first column:
5. The above pattern match will also match if the first column contains "Rents". To match exactly for the word "Rent" in the first column:
$ awk -F, '$1=="Rent"'
file
Rent,900
6. To print only the 2nd column
for all "Medicine" records:
$ awk -F, '$1 ==
"Medicine"{print $2}' file
200
600
7. To match for patterns
"Rent" or "Medicine" in the file:
$ awk '/Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
8. Similarly, to match for
this above pattern only in the first column:
$ awk -F, '$1 ~ /Rent|Medicine/'
file
Medicine,200
Rent,900
Medicine,600
9. What if the the first column
contains the word "Medicines". The above example will match it
as well. In order to exactly match only for Rent or Medicine,
$ awk -F, '$1 ~ /^Rent$|^Medicine$/'
file
Medicine,200
Rent,900
Medicine,600
The ^ symbol indicates
beginning of the line, $ indicates the end of the line. ^Rent$ matches exactly
for the word Rent in the first column, and the same is for the word Medicine
as well.
10. To print the lines which does not contain the pattern Medicine:
10. To print the lines which does not contain the pattern Medicine:
$ awk '!/Medicine/' file
Grocery,500
Rent,900
Grocery,800
The ! is used to negate the pattern search.
11. To negate the pattern only on the first column alone:
11. To negate the pattern only on the first column alone:
$ awk -F, '$1 !~ /Medicine/' file
Grocery,500
Rent,900
Grocery,800
12. To print all records whose
amount is greater than 500:
$ awk -F, '$2>500' file
Rent,900
Grocery,800
Medicine,600
13. To print the Medicine record
only if it is the 1st record:
$ awk 'NR==1 && /Medicine/'
file
Medicine,200
This is how the logical
AND(&&) condition is used in awk. The records needed to be
retrieved is only if it is the first record(NR==1) and the record is a medicine
record.
14. To print all those Medicine records whose amount is greater than 500:
14. To print all those Medicine records whose amount is greater than 500:
$ awk -F, '/Medicine/ &&
$2>500' file
Medicine,600
15. To print all the Medicine records
and also those records whose amount is greater than 600:
$ awk -F, '/Medicine/ || $2>600'
file
Medicine,200
Rent,900
Grocery,800
Medicine,600
This is how the
logical OR(||) condition is used in awk.
awk - split a file into multiple files
In this article of the awk series, we will see the
different scenarios in which we need to split a file into multiple files using awk. The files can be split into
multiple files either based on a condition, or based on a pattern or because
the file is big and hence needs to split into smaller files.
Sample File1:
Let us consider a sample file with the following contents:
Sample File1:
Let us consider a sample file with the following contents:
$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. Split the file into 3 different files, one for each item.
i.e, All records pertaining to Item1 into a file, records of Item2 into
another, etc.
$ awk -F, '{print > $1}' file1
The files generated by the above
command are as below:
$ cat Item1
Item1,200
Item1,600
$ cat Item3
Item3,900
$ cat Item2
Item2,500
Item2,800
This looks so simple, right? print prints
the entire line, and the line is printed to a file whose name is $1, which is
the first field. This means, the first record will get written to a file named
'Item1', and the second record to 'Item2', third to 'Item3', 4th goes to
'Item2', and so on.
2. Split the files by having an extension of .txt to the new file names.
2. Split the files by having an extension of .txt to the new file names.
$ awk -F, '{print > $1".txt"}' file1
The only change here from the above is
concatenating the string ".txt" to the $1 which is the first field.
As a result, we get the extension to the file names. The files created are
below:
$ ls *.txt
Item2.txt Item1.txt
Item3.txt
3. Split the files by having only the value(the second
field) in the individual files, i.e, only 2nd field in the new
files without the 1st field:
$ awk -F, '{print $2 > $1".txt"}'
file1
The print command prints the
entire record. Since we want only the second field to go to the output files,
we do: print $2.
$ cat Item1.txt
200
600
4. Split the files so that all the items whose value is
greater than 500 are in the file "500G.txt",
and the rest in the file "500L.txt".
$ awk -F, '{if($2<=500)print >
"500L.txt";else print > "500G.txt"}' file1
The
output files created will be as below:
$ cat 500L.txt
Item1,200
Item2,500
$ cat 500G.txt
Item3,900
Item2,800
Item1,600
Check the
second field($2). If it is lesser or equal to 500, the record goes to
"500L.txt", else to "500G.txt".
Other way to achieve the same thing is using the ternary operator in awk:
Other way to achieve the same thing is using the ternary operator in awk:
$ awk -F,
'{x=($2<=500)?"500L.txt":"500G.txt"; print > x}'
file1
The condition
for greater or lesser than 500 is checked and the appropriate file name is
assigned to variable x. The record is then written to the file present
in the variable x.
Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.
$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO
5. Split the file into
multiple files at every occurrence of the pattern START .
$ awk
'/START/{x="F"++i;}{print > x;}' file2
This command
contains 2 sets of curly braces: The control goes to the first set of braces
only on encountering a line containing the pattern START. The second set will
be encountered by every line since there is no condition and hence always true.
On encountering
the pattern START, a new file name is created and stored. When the first START
comes, x will contain "F1" and the control goes to the next set of
braces and the record is written to F1, and the subsequent records go the file
"F1" till the next START comes. On encountering next START, x will
contain "F2" and the subsequent lines goes to "F2" till the
next START, and it continues.
$ cat F1
START
Unix
Linux
Solaris
$ cat F2
START
Aix
SCO
6. Split the file into multiple files at every occurrence of
the pattern START. But the line containing the pattern should
not be in the new files.
$ awk '/START/{x="F"++i;next}{print > x;}' file2
The only
difference in this from the above is the inclusion of the next command.
Due to the next command, the lines containing the START enters the first curly
braces and then starts reading the next line immediately due to the next command.
As a result, the START lines does not get to the second curly braces and hence
the START does not appear in the split files.
$ cat F1
Unix
Linux
$ cat F2
Solaris
Aix
SCO
7. Split the file by inserting a header record in every new
file.
$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print >
x;}' file2
The change here from the earlier
one is this: Before the next command, we write the header record
into the file. This is the right place to write the header record since this is
where the file is created first.
$ cat F1
ANY HEADER
Unix
Linux
$ cat F2
ANY HEADER
Solaris
Aix
SCO
Sample File3:
Let us consider a file with the sample contents:
Let us consider a file with the sample contents:
$ cat file3
Unix
Linux
Solaris
AIX
SCO
8. Split the file into multiple files at every 3rd line .
i.e, First 3 lines into F1, next 3 lines into F2 and so on.
$ awk
'NR%3==1{x="F"++i;}{print > x}'
file3
In other words,
this is nothing but splitting the file into equal parts. The condition does the
trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be
equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd
line, the file name is changed in the variable x, and hence the records are
written to the appropriate files.
$ cat F1
Unix
Linux
Solaris
$ cat F2
Aix
SCO
Sample File4:
Let us update the above file with a header and trailer:
Let us update the above file with a header and trailer:
$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER
9. Split the file at every 3rd line without
the header and trailer in the new files.
sed '1d;$d;' file4 | awk
'NR%3==1{x="F"++i;}{print > x}'
The earlier command does the work for us, only thing is to pass to the
above command without the header and trailer. sed does it for us. '1d' is
to delete the 1st line, '$d' to delete the last line.
$ cat F1
Unix
Linux
Solaris
$ cat F2
AIX
SCO
10. Split the file at every 3rd line, retaining the
header and trailer in every file.
$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print
f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4
This one is
little tricky. Before the file is processed, the first line is read using getline into
the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case
because since the first line is a header, we need to split the files at 2nd,
5th, 8th lines, and so on. All the file names are stored in the array
"a" for later processing.
Without the END label,
all the files will have the header record, but only the last file will have the
trailer record. So, the END label is to precisely write the trailer record to
all the files other than the last file.
$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER
$ cat F2
HEADER
Aix
SCO
TRAILER
awk - Join or merge lines
on finding a pattern
In one of our earlier articles, we had discussed about
joining all lines in a file and also joining every 2 lines in a file. In this article, we will see
the how we can join lines based on a pattern or joining lines on encountering a
pattern using awk or gawk.
Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.
Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.
$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO
1. Join the lines following the
pattern START without any delimiter.
$ awk '/START/{if (NR!=1)print
"";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO
Basically, what we are
trying to do is: Accumulate the lines following the START and print them
on encountering the next START statement. /START/ searches for lines
containing the pattern START. The command within the {} will work only on
lines containing the START pattern. Prints a blank line if the line is not the
first line(NR!=1). Without this condition, a blank line will come in the very
beginning of the output since it encounters a START in the beginning.
The next command
prevents the remaining part of the command from getting executed for the START
lines. The second part of braces {} works only for the lines not containing the
START. This part simply prints the line without a terminating new line
character(printf). And hence as a result, we get all the lines after the
pattern START in the same line. The END label is put to print a newline at the
end without which the prompt will appear at the end of the last line of output
itself.
2. Join the lines following the pattern START with space as delimiter.
$ awk '/START/{if (NR!=1)print
"";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO
This is same as the earlier one except it
uses the format specifier %s in order to accommodate an additional space which
is the delimiter in this case.
3. Join the lines following the pattern START with comma as delimiter.
3. Join the lines following the pattern START with comma as delimiter.
$ awk '/START/{if (x)print
x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO
Here, we form a
complete line and store it in a variable x and print the variable x
whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is
like the ternary operator in C or Perl. It means if x is empty, assign the
current line($0) to x, else append a comma and the current line to x. As
a result, x will contain the lines joined with a comma following the
START pattern. And in the END label, x is printed since for the last
group there will not be a START pattern to print the earlier group.
4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.
$ awk '/START/{if (x)print
x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO
The difference here is the missing next
statement. Because next is not there, the commands present in the second
set of curly braces are applicable for the START line as well, and hence it
also gets concatenated.
5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.
5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.
$ awk '/START/{if (x)print
x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO
In this, instead of forming START as part of the variable x,
the START line is printed. As a result, the START line comes out separately,
and the remaining lines get joined.
awk - 10 examples to
group data in a CSV or text file
awk is very powerful when it comes
for file formatting. In this article, we will discuss some wonderful
grouping features of awk. awk can group a data based on a column or field
, or on a set of columns. It uses the powerful associative array for grouping.
If you are new to awk, this article will be easier to understand if you can go
over the article how to parse a simple CSV file using awk.
Let us take a sample CSV file with
the below contents. The file is kind of an expense report containing items and
their prices. As seen, some expense items have multiple entries.
$ cat file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. To find the total of all numbers in second column.
i.e, to find the sum of all the prices.
$ awk -F","
'{x+=$2}END{print x}' file
3000
The delimiter(-F) used
is comma since its a comma separated file. x+=$2 stands for x=x+$2.
When a line is parsed, the second column($2) which is the price, is added to
the variable x. At the end, the variable x contains the sum. This example is
same as discussed in the awk example of finding the sum of all numbers in a
file.
If your input file is a
text file with the only difference being the comma not present in the above
file, all you need to make is one change. Remove this part from the above
command: -F"," . This is because the default delimiter in awk
is whitespace.
2. To find the total sum of particular group entry alone. i.e, in this case, of "Item1":
$ awk -F,
'$1=="Item1"{x+=$2;}END{print x}' file
800
This gives us the total sum
of all the items pertaining to "Item1". In the earlier example, no
condition was specified since we wanted awk to work on every line or record. In
this case, we want awk to work on only the records whose first column($1) is
equal to Item1.
3. If the data to be worked upon is present in a shell variable:
$ VAR="Item1"
$ awk -F, -v inp=$VAR
'$1==inp{x+=$2;}END{print x}' file
800
-v is used to pass the shell variable to awk,
and the rest is same as the last one.
4. To find unique values of first column
4. To find unique values of first column
$ awk -F, '{a[$1];}END{for (i in
a)print i;}' file
Item1
Item2
Item3
Arrays in awk
are associative and is a very powerful feature. Associate arrays have an
index and a corresponding value. Example: a["Jan"]=30 meaning
in the array a, "Jan" is an index with value 30. In our
case here, we use only the index without values. So, the command a[$1] works
like this: When the first record is processed, in the array named a, an
index value "Item1" is stored. During the second record, a new index
"Item2", during third "Item3" and so on. During the 4th
record, since the "Item1" index is already there, no new index is
added and the same continues.
Now, once the file is processed
completely, the control goes to the END label where we print all the index
items. for loop in awk comes in 2 variants: 1. The C language kind of for loop,
Second being the one used for associate arrays.
for i in a : This
means for every index in the array a . The variable "i"
holds the index value. In place of "i", it can be any variable name.
Since there are 3 elements in the array, the loop will run for 3 times, each
time holding the value of an index in the "i". And by printing
"i", we get the index values printed.
To understand the for loop better, look at this:
for (i in a)
{
print i;
}
Note: The order of the output in the above command may vary
from system to system. Associative arrays do not store the indexes in sequence
and hence the order of the output need not be the same in which it is entered.
5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.
5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.
$ awk -F, '{a[$1]+=$2;}END{for(i in
a)print i", "a[i];}' file
Item1, 800
Item2, 1300
Item3, 900
a[$1]+=$2 . This can be
written as a[$1]=a[$1]+$2. This works like this: When the first record is
processed, a["Item1"] is assigned 200(a["Item1"]=200).
During second "Item1" record, a["Item1"]=800 (200+600) and
so on. In this way, every index item in the array is stored with the
appropriate value associated to it which is the sum of the group.
And in the END label,
we print both the index(i) and the value(a[i]) which is nothing but the sum.
6. To find the sum of all entries in second column and add it as the last record.
$ awk -F","
'{x+=$2;print}END{print "Total,"x}' file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
Total,3000
This is same as the first example except that
along with adding the value every time, every record is also printed, and at
the end, the "Total" record is also printed.
7. To print the maximum or the biggest record of every group:
7. To print the maximum or the biggest record of every group:
$ awk -F, '{if (a[$1] <
$2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file
Item1,600
Item2,800
Item3,900
Before storing
the value($2) in the array, the current second column value is compared
with the existing value and stored only if the value in the current record is
bigger. And finally, the array will contain only the maximum values against
every group. In the same way, just by changing the "lesser
than(<)" symbol to greater than(>), we can find the smallest element
in the group.
The syntax for if in awk is, similar
to the C language syntax:
if (condition)
{
<code for true condition >
}else{
<code for false condition>
}
8. To find the count of entries against every group:
$ awk -F, '{a[$1]++;}END{for (i in
a)print i, a[i];}' file
Item1 2
Item2 2
Item3 1
a[$1]++ : This can be
put as a[$1]=a[$1]+1. When the first "Item1" record is parsed,
a["Item1"]=1 and every item on encountering "Item1" record,
this count is incremented, and the same follows for other entries as well. This
code simply increments the count by 1 for the respective index on encountering
a record. And finally on printing the array, we get the item entries and their
respective counts.
9. To print only the first record of every group:
$ awk -F, '!a[$1]++' file
Item1,200
Item2,500
Item3,900
A little tricky
this one. In this awk command, there is only condition, no action
statement. As a result, if the condition is true, the current record gets
printed by default.
!a[$1]++ : When the first
record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and
not(!) of 0 is 1 which is true, and hence the first record gets printed. Now,
when the second records of "Item1" is parsed, a[$1] is 1 (will
become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is
false, and the record does not get printed. In this way, the first record of
every group gets printed.
Simply by removing '!'
operator, the above command will print all records other than the first record
of the group.
10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:
$ awk -F,
'{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i,
a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
This if
condition is pretty simple: If there is some value in a[$1], then append or
concatenate the current value using a colon delimiter, else just assign it to
a[$1] since this is the first value.
To make the above if block clear,
let me put it this way: "if (a[$1])" means "if
a[$1] has some value".
if(a[$1])
a[$1]=a[$1]":"$2;
else
a[$1]=$2
The same can be achieved using the awk ternary
operator as well which is same as in the C language.
$ awk -F,
'{a[$1]=a[$1]?a[$1]":"$2:$2;}END{for (i in a)print i, a[i];}' OFS=,
file
Item1,200:600
Item2,500:800
Item3,900
Ternary operator is a short form of
if-else condition. An example of ternary operator is: x=x>10?"Yes":"No"
means if x is greater than 10, assign "Yes" to x, else assign
"No".
In the same way: a[$1]=a[$1]?a[$1]":"$2:$2
means if a[$1] has some value assign a[$1]":"$2 to a[$1] ,
else simply assign $2 to a[$1].
Concatenate variables in awk:
One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.
Examples:
z=x y #to concatenate x
and y
z=x":"y #to
concatenate x and y with a colon separator.
awk - Examples to read
files with multiple delimiters
In this article of awk series, we will see how to use awk to
read or parse text or CSV files containing multiple delimiters or repeating
delimiters. Also, we will discuss about some peculiar delimiters and how to
handle them using awk.
Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
$ cat file
Item1:2010:10;20;30
Item2:2012:12;29;19
Item3:2014:15;50;61
1. To print the 3rd column which contains the prices:
$ awk -F: '{print $3}' file
10;20;30
12;29;19
15;50;61
This is straight forward. By specifying colon(:) in
the option with -F, the 3rd column can be retrieved using the $3 variable.
2. To print the 1st component of $3 alone:
2. To print the 1st component of $3 alone:
$ awk -F '[:;]' '{print $4}' file
20
29
50
What did we do here?
Specified multiple delimiters, one is : and other is ; . How awk parses
the file? Its simple. First, it looks at the delimiters which is colon(:) and
semi-colon(;). This means, while reading the line, as and when the delimiter :
or ; is encountered, store the part read in $1. Continue further. Again
on encountering one of the delimiters, store the read part in $2. And this
continues till the end of the line is reached. In this way, $4 contained
the first part of the price component above.
Note: Always keep in
mind. While specifying multiple delimiters, it has to be specified inside
square brackets( [;:] ).
3. To sum the individual components of the 3rd column and print it:
$ awk -F '[;:]' '{$3=$3+$4+$5;print
$1,$2,$3}' OFS=: file
Item1:2010:60
Item2:2012:60
Item3:2014:126
The individual
components of the price($3) column are available in $3, $4 and $5. Simply, sum
them up and store in $3, and print all the variables. OFS (output field
separator) is used to specify the delimiter while printing the output.
Note: If we do not use the OFS, awk
will print the fields using the default output delimiter which is space.
4. Un-group or re-group every record depending on the price column:
$ awk -F '[;:]'
'{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file
Item1:2010:10
Item1:2010:20
Item1:2010:30
Item2:2012:12
Item2:2012:29
Item2:2012:19
Item3:2014:15
Item3:2014:50
Item3:2014:61
The requirement here is: New
records have to be created for every component of the price column. Simply, a
loop is run on from columns 3 to 5, and every time a record is framed using the
price component.
5-6. Read file in which the delimiter is square brackets:
5-6. Read file in which the delimiter is square brackets:
$ cat file
123;abc[202];124
125;abc[203];124
127;abc[204];124
5. To print the value present within the
brackets:
$ awk -F '[][]' '{print $2}' file
202
203
204
At the first
sight, the delimiter used in the above command might be confusing. Its simple.
2 delimiters are to be used in this case: One is [ and the other is ].
Since the delimiters itself is square brackets which is to be placed within the
square brackets, it looks tricky at the first instance.
Note: If square brackets are
delimiters, it should be put in this way only, meaning first ] followed by [.
Using the delimiter like -F '[[]]' will give a different interpretation
altogether.
6. To print the first value, the value within brackets, and the last value:
$ awk -F '[][;]' '{print $1,$3,$5}'
OFS=";" file
123;202;124
125;203;124
127;204;124
3 delimiters are used in this case with
semi-colon also included.
7-8. Read or parse a file containing a series of delimiters:
7-8. Read or parse a file containing a series of delimiters:
$ cat file
123;;;202;;;203
124;;;213;;;203
125;;;222;;;203
The above file contains a series of 3
semi-colons between every 2 values.
7. Using the multiple delimiter method:
7. Using the multiple delimiter method:
$ awk -F'[;;;]' '{print $2}' file
Blank output !!! The
above delimiter, though specified as 3 colons is as good as one delimiter which
is a semi-colon(;) since they are all the same. Due to this, $2 will be the
value between the first and the second semi-colon which in our case is blank
and hence no output.
8. Using the delimiter without square brackets:
$ awk -F';;;' '{print $2}' file
202
213
222
The expected output !!! No square brackets is used and we
got the output which we wanted.
Difference between using square
brackets and not using it : When a set of delimiters are
specified using square brackets, it means an OR condition of the delimiters.
For example, -F '[;:]' means to separate the contents either on
encountering ':' or ';'. However, when a set of delimiters are specified
without using square brackets, awk looks at them literally to separate the
contents. For example, -F ':;' means to separate the contents only on
encountering a colon followed by a semi-colon. Hence, in the last example, the
file contents are separated only when a set of 3 continuous semi-colons are
encountered.
9. Read or parse a file containing a series of delimiters of varying lengths:
In the below file, the 1st and 2nd column are separated using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons
$ cat file
123;;;202;;;;203
124;;;213;;;;203
125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file
202 203
213 203
222 203
The '+' is a
regular expression. It indicates one or more of previous characters. ';'+
indicates one or more semi-colons, and hence both the 3 semi-colons and 4
semi-colons get matched.
10. Using a word as a delimiter:
$ cat file
123Unix203
124Unix203
125Unix203
Retrieve the numbers before and after
the word "Unix" :
$ awk -F'Unix' '{print $1, $2}' file
123 203
124 203
125 203
In this case, we
use the word "Unix" as the delimiter. And hence $1 and $2 contained
the appropriate values . Keep in mind, it is not just the special characters
which can be used as delimiters. Even alphabets, words can also be used as
delimiters.
P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.
awk - Passing awk
variables to shell
In one of our earlier articles, we discussed how to access
or pass shell variables to awk. In this, we will see how to access the awk
variables in shell? Or How to access awk variables as shell variables ?
Let us see the different ways in which we can achieve this.
Let us consider a file with the sample contents as below:
Let us consider a file with the sample contents as below:
$
cat file
Linux
20
Solaris
30
HPUX
40
1. Access the value of the entry "Solaris" in a
shell variable, say x:
$
x=`awk '/Solaris/{a=$2;print a}' file`
$
echo $x
30
This approach is fine as long as we want to access only one
value. What if we have to access multiple values in shell?
2. Access the value of "Solaris" in x, and "Linux" in y:
2. Access the value of "Solaris" in x, and "Linux" in y:
$
z=`awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print
"y="$2}' file`
$
echo "$z"
y=20
x=30
$
eval $z
$
echo $x
30
$
echo $y
20
awk sets the value of "x" and
"y" awk variables and prints which is collected in the shell variable
"z". The eval command evaluates the variable meaning it executes the
commands present in the variable. As a result, "x=30" and
"y=20" gets executed, and they become shell variables x and y with
appropriate values.
3. Same using the sourcing method:
3. Same using the sourcing method:
$
awk '{if($1=="Solaris")print
"x="$2;if($1=="Linux")print "y="$2}' file > f1
$
source f1
$
echo $x
30
$
echo $y
20
Here, instead of collecting the output of awk command in a
variable, it is re-directed to a temporary file. The file is then sourced or in
other words executed in the same shell. As a result, "x" and
"y" become shell variables.
Note: Depending on the shell being used, the appropriate way of sourcing has to be done. The "source" command is used here since the default shell is bash.
Note: Depending on the shell being used, the appropriate way of sourcing has to be done. The "source" command is used here since the default shell is bash.
EmoticonEmoticon