awk command examples in unix

awk - Read a file and split the contents

awk is one of the most powerful utilities used in the unix world. Whenever it comes to text parsing, sed and awk do some unbelievable things. In this first article on awk, we will see the basic usage of awk.

The syntax of awk is:

awk 'pattern{action}' file

   where the pattern indicates the pattern or the condition on which the action is to be executed for every line matching the pattern. In case of a pattern not being present, the action will be executed for every line of the file. In case of the action part not being present, the default action of printing the line will be done. Let us see some examples:

Assume a file, say file1, with the following content:

$ cat file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
  This file has 2 fields in it. The first field indicates the name of a person, and the second field denoting their expertise, the first line being the header record.

1. To print only the names present in the file:

$ awk '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru
   The above awk command does not have any pattern or condition. Hence, the action will be executed on every line of the file. The action statement reads "print $1". awk, while reading a file, splits the different columns into $1, $2, $3 and so on. And hence the first column is accessible using $1, second using $2, etc. And hence the above command prints all the names which happens to be first column in the file.

2.  Similarly, to print the second column of the file:

$ awk '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration
3. In the first example, the list of names got printed along with the header record. How to omit the header record and get only the names printed?

$ awk 'NR!=1{print $1}' file1
Deepak
Neha
Vijay
Guru
    The above awk command uses a special variable NR. NR denotes line number ranging from 1 to the actual line count. The conditon 'NR!=1' indicates not to execute the action part for the first line of the file, and hence the header record gets skipped.

4. How do we print the entire file contents?

$ awk '{print $0}' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
  $0 stands for the entire line. And hence when we do "print $0", the whole line gets printed.

5. How do we get the entire file content printed in other way?

$ awk '1' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
   The above awk command has only the pattern or condition part, no action part. The '1' in the pattern indicates "true" which means true for every line. As said above, no action part denotes just to print which is the default when no action statement is given, and hence the entire file contents get printed.

Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma separated file is called csv file. Assuming the file contents to be:

$ cat file1
Name,Domain,Expertise
Deepak,Banking,MQ Series
Neha,Telecom,Power Builder
Vijay,Finance,CRM Expert
Guru,Migration,Unix
  This file contains 3 fields. The new field being the expertise of the respective person.

6. Let us try to print the first column of this csv file using the same method as mentioned in Point 1.

$ awk '{print $1}' file1
Name,Domain,Expertise
Deepak,Banking,MQ
Neha,Telecom,Power
Vijay,Finance,CRM
Guru,Migration,Unix
   The output looks weird. Isnt it? We expected only the first column to get printed, but it printed little more and that too not a definitive one. If you notice carefully, it printed every line till the first space is encountered. awk, by default, uses the white space as the delimiter which could be a single space, tab space or a series of spaces. And hence our original file was split into fields depending on space.

  Since our requirement now involves dealing with a file which is comma separated, we need to specify the delimiter.

$ awk -F","  '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru
  awk has a command line option "-F' with which we can specify the delimiter. Once the delimiter is specified, awk splits the file on the basis of the delimiter specified, and hence we got the names by printing the first column $1.

7. awk has a special variable called "FS" which stands for field separator. In place of the command line option "-F', we can also use the "FS".

$ awk  '{print $1,$3}' FS="," file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
8. Similarly, to print the second column:

$ awk -F, '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration

9. To print the first and third columns, ie., the name and the expertise:

$ awk -F"," '{print $1, $3}' file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
10. The output shown above is not easily readable since the third column has more than one word. It would have been better had the fields being displayed are present with a delimiter. Say, lets use comma to separate the output. Also, lets discard the header record.

$ awk -F"," 'NR!=1{print $1,$3}' OFS="," file1
Deepak,MQ Series
Neha,Power Builder
Vijay,CRM Expert
Guru,Unix
  OFS is another awk special variable. Just like how FS is used to separate the input fields, OFS (Output field separator) is used to separate the output fields.

awk - Passing arguments or shell variables to awk

In one of our earlier articles, we saw how to read a file in awk. At times, we might have some requirements wherein we need to pass some arguments to the awk program or to access a shell variable or an environment variable inside awk. Let us see in this article how to pass and access arguments in awk:

Let us take a sample file with contents, and a variable "x":

$ cat file1
24
12
34
45
$ echo $x
3
Now, say we want to add every value with the shell variable x.


1.awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it.

$ awk -v val=$x '{print $0+val}' file1
27
15
37
48
  As seen above, the shell variable $x is assigned to the awk variable "val". This variable "val" can directly be accessed in awk.

2. awk provides another way of passing argument to awk without using -v. Just before specifying the file name to awk, provide the shell variable assignments to awk variables as shown below:

$ awk  '{print $0,val}' OFS=, val=$x file1
24,3
12,3
34,3
45,3
3. How to access environment variables in awk? Unlike shell variables, awk provides a way to access the environment variables without passing it as above. awk has a special variable ENVIRON which does the needful.

$ echo $x
3
$ export x
$ awk  '{print $0,ENVIRON["x"]}' OFS=,  file1
24,3
12,3
34,3
45,3

Quoting file content:

  Some times we might have a requirement wherein we have to quote the file contents. Assume, you have a file which contains the list of database tables. And for your requirement, you need to quote the file contents:

$ cat file
CUSTOMER
BILL
ACCOUNT

4. Pass a variable to awk which contains the double quote. Print the quote, line, quote.

$ awk -v q="'" '{print q $0 q}' file
'CUSTOMER'
'BILL'
'ACCOUNT'
5. Similarly, to double quote the contents, pass the variable within single quotes:

$ awk '{print q $0 q}' q='"' file
"CUSTOMER"
"BILL"
"ACCOUNT"

awk - Match a pattern in a file in Linux

In one of our earlier articles on awk series, we had seen the basic usage of awk or gawk. In this, we will see mainly how to search for a pattern in a file in awk. Searching pattern in the entire line or in a specific column.

  Let us consider a csv file with the following contents. The data in the csv file contains kind of expense report. Let us see how to use awk to filter data from the file.

$ cat file
Medicine,200
Grocery,500
Rent,900
Grocery,800
Medicine,600
1. To print only the records containing Rent:

$ awk '$0 ~ /Rent/{print}' file
Rent,900
     ~ is the symbol used for pattern matching.  The / / symbols are used to specify the pattern. The above line indicates: If the line($0) contains(~) the pattern Rent, print the line. 'print' statement by default prints the entire line. This is actually the simulation of grep command using awk.

2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off as shown below:

$ awk '/Rent/{print}' file
Rent,900
3. Since awk prints the line by default on a true condition, print statement can also be left off.

$ awk '/Rent/' file
Rent,900
 In this example, whenever the line contains Rent, the condition becomes true and the line gets printed.

4. In the above examples, the pattern matching is done on the entire line, however, the pattern we are looking for is only on the first column.  This might lead to incorrect results if the file contains the word Rent in other places. To match a pattern only in the first column($1),

$ awk -F, '$1 ~ /Rent/' file
Rent,900
      The -F option in awk is used to specify the delimiter. It is needed here since we are going to work on the specific columns which can be retrieved only when the delimiter is known.

5. The above pattern match will also match if the first column contains "Rents". To match exactly for the word "Rent" in the first column:

$ awk -F, '$1=="Rent"' file
Rent,900
6. To print only the 2nd column for all "Medicine" records:

$ awk -F, '$1 == "Medicine"{print $2}' file
200
600
7. To match for patterns "Rent" or "Medicine" in the file:

$ awk '/Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
8. Similarly, to match for this above pattern only in the first column:

$ awk -F, '$1 ~ /Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
9. What if the the first column contains the word "Medicines". The above example will match it as well. In order to exactly match only for Rent or Medicine,

$ awk -F, '$1 ~ /^Rent$|^Medicine$/' file
Medicine,200
Rent,900
Medicine,600
    The ^ symbol indicates beginning of the line, $ indicates the end of the line. ^Rent$ matches exactly for the word Rent in the first column, and the same is for the word Medicine as well.

10. To print the lines which does not contain the pattern Medicine:

$ awk '!/Medicine/' file
Grocery,500
Rent,900
Grocery,800
    The ! is used to negate the pattern search.

11. To negate the pattern only on the first column alone:

$ awk -F, '$1 !~ /Medicine/' file
Grocery,500
Rent,900
Grocery,800
12. To print all records whose amount is greater than 500:

$ awk -F, '$2>500' file
Rent,900
Grocery,800
Medicine,600
13. To print the Medicine record only if it is the 1st record:

$ awk 'NR==1 && /Medicine/' file
Medicine,200
    This is how the logical AND(&&) condition is used in awk.  The records needed to be retrieved is only if it is the first record(NR==1) and the record is a medicine record.

14. To print all those Medicine records whose amount is greater than 500:

$ awk -F, '/Medicine/ && $2>500' file
Medicine,600
15. To print all the Medicine records and also those records whose amount is greater than 600:

$ awk -F, '/Medicine/ || $2>600' file
Medicine,200
Rent,900
Grocery,800
Medicine,600
    This is how the logical OR(||) condition is used in awk.

awk -  split a file into multiple files
 In this article of the awk series, we will see the different scenarios in which we need to split a file into multiple files using awk. The files can be split into multiple files either based on a condition, or based on a pattern or because the file is big and hence needs to split into smaller files.

Sample File1:
Let us consider a sample file with the following contents:

$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a file, records of Item2 into another, etc.

$ awk -F, '{print > $1}' file1
       The files generated by the above command are as  below:

$ cat Item1
Item1,200
Item1,600
$ cat Item3
Item3,900
$ cat Item2
Item2,500
Item2,800
   This looks so simple, right? print prints the entire line, and the line is printed to a file whose name is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and the second record to 'Item2', third to 'Item3',  4th goes to 'Item2', and so on.

2. Split the files by having an extension of .txt to the new file names.

$ awk -F, '{print > $1".txt"}' file1
    The only change here from the above is concatenating the string ".txt" to the $1 which is the first field. As a result, we get the extension to the file names. The files created are below:

$ ls *.txt
Item2.txt  Item1.txt  Item3.txt
3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd field in the new files without the 1st field:

$ awk -F, '{print $2 > $1".txt"}' file1
     The print command prints the entire record. Since we want only the second field to go to the output files, we do: print $2.

$ cat Item1.txt
200
600
4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt", and the rest in the  file "500L.txt".

$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}' file1
        The output files created will be as below:

$ cat 500L.txt
Item1,200
Item2,500
$ cat 500G.txt
Item3,900
Item2,800
Item1,600
     Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to "500G.txt".
  Other way to achieve the same thing is using the ternary operator in awk:

$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1
      The condition for greater or lesser than 500 is checked and the appropriate file name is assigned to variable x. The record is then written to the file present in the variable x.

Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.

$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO
5. Split the file into multiple files at every occurrence of the pattern START .

$ awk '/START/{x="F"++i;}{print > x;}' file2
     This command contains 2 sets of curly braces: The control goes to the first set of braces only on encountering a line containing the pattern START. The second set will be encountered by every line since there is no condition and hence always true.
   On encountering the pattern START, a new file name is created and stored. When the first START comes, x will contain "F1" and the control goes to the next set of braces and the record is written to F1, and the subsequent records go the file "F1" till the next START comes. On encountering next START, x will contain "F2" and the subsequent lines goes to "F2" till the next START, and it continues.
$ cat F1
START
Unix
Linux
Solaris
$ cat F2
START
Aix
SCO
6. Split the file into multiple files at every occurrence of the pattern START. But the line containing the pattern should not be in the new files.

$ awk '/START/{x="F"++i;next}{print > x;}' file2
       The only difference in this from the above is the inclusion of the next command. Due to the next command, the lines containing the START enters the first curly braces and then starts reading the next line immediately due to the next command. As a result, the START lines does not get to the second curly braces and hence the START does not appear in the split files.
$ cat F1
Unix
Linux
$ cat F2
Solaris
Aix
SCO
7. Split the file by inserting a header record in every new file.

$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}' file2
       The change here from the earlier one is this: Before the next command, we write the header record into the file. This is the right place to write the header record since this is where the file is created first.

$ cat F1
ANY HEADER
Unix
Linux
$ cat F2
ANY HEADER
Solaris
Aix
SCO
Sample File3:
Let us consider a file with the sample contents:

$ cat file3
Unix
Linux
Solaris
AIX
SCO
8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2 and so on.

$ awk 'NR%3==1{x="F"++i;}{print > x}'  file3
      In other words, this is nothing but splitting the file into equal parts. The condition does the trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x, and hence the records are written to the appropriate files.
$ cat F1
Unix
Linux
Solaris
$ cat F2
Aix
SCO 
Sample File4:
Let us update the above file with a header and trailer:

$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER
9. Split the file at every 3rd line without the header and trailer in the new files.

sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}'
         The earlier command does the work for us, only thing is to pass to the above command without the header and trailer. sed does it for us.  '1d' is to delete the 1st line, '$d' to delete the last line.
$ cat F1
Unix
Linux
Solaris
$ cat F2
AIX
SCO
10.  Split the file at every 3rd line, retaining the header and trailer in every file.

$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4
     This one is little tricky. Before the file is processed, the first line is read using getline into the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the array "a" for later processing.
    Without the END label, all the files will have the header record, but only the last file will have the trailer record. So, the END label is to precisely write the trailer record to all the files other than the last file.
$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER
$ cat F2
HEADER
Aix
SCO
TRAILER
awk - Join or merge lines on finding a pattern
 In one of our earlier articles, we had discussed about joining all lines in a file and also joining every 2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining lines on encountering a pattern using awk or gawk.

Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.

$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO
1. Join the lines following the pattern START without any delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO
    Basically, what we are trying to do is:  Accumulate the lines following the START and print them on encountering the next START statement. /START/ searches for lines containing the pattern START.  The command within the {} will work only on lines containing the START pattern. Prints a blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the very beginning of the output since it encounters a START in the beginning. 

   The next command prevents the remaining part of the command from getting executed for the START lines. The second part of braces {} works only for the lines not containing the START. This part simply prints the line without a terminating new line character(printf). And hence as a result, we get all the lines after the pattern START in the same line. The END label is put to print a newline at the end without which the prompt will appear at the end of the last line of output itself.

2. Join the lines following the pattern START with space as delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO
    This is same as the earlier one except it uses the format specifier %s in order to accommodate an additional space which is the delimiter in this case.

3. Join the lines following the pattern START with comma as delimiter.

$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO
    Here, we form a complete line and store it in a variable x and print the variable x whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is empty, assign the current line($0) to x, else append a comma and the current line to x. As a result, x will contain the lines joined with a comma following the START pattern. And in the END label, x is printed since for the last group there will not be a START pattern to print the earlier group.

4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.

$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO
     The difference here is the missing next statement. Because next is not there, the commands present in the second set of curly braces are applicable for the START line as well, and hence it also gets concatenated.

5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.

$ awk '/START/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO
In this, instead of forming START as part of the variable x, the START line is printed. As a result, the START line comes out separately, and the remaining lines get joined.



awk - 10 examples to group data in a CSV or text file

awk is very powerful when it comes for file formatting.  In this article, we will discuss some wonderful grouping features of awk. awk can group a data based on a column or field , or on a set of columns. It uses the powerful associative array for grouping. If you are new to awk, this article will be easier to understand if you can go over the article how to parse a simple CSV file using awk.

Let us take a sample CSV file with the below contents. The file is kind of an expense report containing items and their prices. As seen, some expense items  have multiple entries.
$ cat file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. To find the total of all numbers in second column. i.e, to find the sum of all the prices.

$ awk -F"," '{x+=$2}END{print x}' file
3000
    The delimiter(-F) used is comma since its a comma separated file. x+=$2 stands for x=x+$2. When a line is parsed, the second column($2) which is the price, is added to the variable x. At the end, the variable x contains the sum. This example is same as discussed in the awk example of finding the sum of all numbers in a file.

   If your input file is a text file with the only difference being the comma not present in the above file, all you need to make is one change. Remove this part from the above command: -F","  . This is because the default delimiter in awk is whitespace.

2. To find the total sum of particular group entry alone. i.e, in this case, of "Item1":

$ awk -F, '$1=="Item1"{x+=$2;}END{print x}' file
800
  This gives us the total sum of all the items pertaining to "Item1". In the earlier example, no condition was specified since we wanted awk to work on every line or record. In this case, we want awk to work on only the records whose first column($1) is equal to Item1.

3. If the data to be worked upon is present in a shell variable:

$ VAR="Item1"
$ awk -F, -v inp=$VAR '$1==inp{x+=$2;}END{print x}' file
800
   -v is used to pass the shell variable to awk, and the rest is same as the last one.

4. To find unique values of first column

$ awk -F, '{a[$1];}END{for (i in a)print i;}' file
Item1
Item2
Item3
    Arrays in awk are associative and is a very powerful feature. Associate arrays have an index and a corresponding value. Example: a["Jan"]=30 meaning in the array a, "Jan" is an index with value 30. In our case here, we use only the index without values. So, the command a[$1] works like this: When the first record is processed, in the array named a, an index value "Item1" is stored. During the second record, a new index "Item2", during third "Item3" and so on. During the 4th record, since the "Item1" index is already there, no new index is added and the same continues.

  Now, once the file is processed completely, the control goes to the END label where we print all the index items. for loop in awk comes in 2 variants: 1. The C language kind of for loop,  Second being the one used for associate arrays.

  for i in a : This means for every index in the array a . The variable "i" holds the index value. In place of "i", it can be any variable name. Since there are 3 elements in the array, the loop will run for 3 times, each time holding the value of an index in the "i". And by printing "i", we get the index values printed.

 To understand the for loop better, look at this:

for (i in a)
{
  print i;
}
Note: The order of the output in the above command may vary from system to system. Associative arrays do not store the indexes in sequence and hence the order of the output need not be the same in which it is entered.

5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.

$ awk -F, '{a[$1]+=$2;}END{for(i in a)print i", "a[i];}' file
Item1, 800
Item2, 1300
Item3, 900
   a[$1]+=$2 . This can be written as a[$1]=a[$1]+$2. This works like this: When the first record is processed, a["Item1"] is assigned 200(a["Item1"]=200). During second "Item1" record, a["Item1"]=800 (200+600) and so on. In this way, every index item in the array is stored with the appropriate value associated to it which is the sum of the group.
   And in the END label, we print both the index(i) and the value(a[i]) which is nothing but the sum.

6. To find the sum of all entries in second column and add it as the last record.

$ awk -F"," '{x+=$2;print}END{print "Total,"x}' file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
Total,3000
   This is same as the first example except that along with adding the value every time, every record is also printed, and at the end, the "Total" record is also printed.

7. To print the maximum or the biggest record of every group:

$ awk -F, '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file
Item1,600
Item2,800
Item3,900
     Before storing the value($2) in the array,  the current second column value is compared with the existing value and stored only if the value in the current record is bigger. And finally, the array will contain only the maximum values against every group. In the same way, just by changing the "lesser than(<)" symbol to greater than(>), we can find the smallest element in the group.
The syntax for if in awk is, similar to the C language syntax:

if (condition)
{  
  <code for true condition >
}else{  
 <code for false condition>
 }


8. To find the count of entries against every group:

$ awk -F, '{a[$1]++;}END{for (i in a)print i, a[i];}' file
Item1 2
Item2 2
Item3 1
    a[$1]++ : This can be put as a[$1]=a[$1]+1. When the first "Item1" record is parsed, a["Item1"]=1 and every item on encountering "Item1" record, this count is incremented, and the same follows for other entries as well. This code simply increments the count by 1 for the respective index on encountering a record. And finally on printing the array, we get the item entries and their respective counts.

9. To print only the first record of every group:

$ awk -F, '!a[$1]++' file
Item1,200
Item2,500
Item3,900
    A little tricky this one. In this awk command, there is only condition, no action statement. As a result, if the condition is true, the current record gets printed by default.
 !a[$1]++ : When the first record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and not(!) of 0 is 1 which is true, and hence the first record gets printed. Now,  when the second records of "Item1" is parsed, a[$1] is 1 (will become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is false, and the record does not get printed. In this way, the first record of every group gets printed.
   Simply by removing '!' operator, the above command will print all records other than the first record of the group.

10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:

$ awk -F, '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
     This if condition is pretty simple: If there is some value in a[$1], then append or concatenate the current value using a colon delimiter, else just assign it to a[$1] since this is the first value.
To make the above if block clear, let me put it this way:  "if (a[$1])"  means "if a[$1] has some value".
if(a[$1])
 a[$1]=a[$1]":"$2;
else
 a[$1]=$2
  The same can be achieved using the awk ternary operator as well which is same as in the C language.

$ awk -F, '{a[$1]=a[$1]?a[$1]":"$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
Ternary operator is a short form of if-else condition. An example of ternary operator is: x=x>10?"Yes":"No"  means if x is greater than 10, assign "Yes" to x, else assign "No".
In the same way: a[$1]=a[$1]?a[$1]":"$2:$2  means if a[$1] has some value assign a[$1]":"$2 to a[$1] , else simply assign $2 to a[$1].


Concatenate variables in awk:
One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.
Examples:

z=x y    #to concatenate x and y
z=x":"y  #to concatenate x and y with a colon separator.


awk - Examples to read files with multiple delimiters
In this article of awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.

Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.

$ cat file
Item1:2010:10;20;30
Item2:2012:12;29;19
Item3:2014:15;50;61

1. To print the 3rd column which contains the prices:

$ awk -F: '{print $3}' file
10;20;30
12;29;19
15;50;61
  This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.

2. To print the 1st component of $3 alone:

$ awk -F '[:;]' '{print $4}' file
20
29
50
    What did we do here? Specified multiple delimiters, one is : and other is ; .  How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered,  store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached.  In this way, $4 contained the first part of the price component above.
Note:    Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).

3. To sum the individual components of the 3rd column and print it:

$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file
Item1:2010:60
Item2:2012:60
Item3:2014:126
      The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output.
Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.

4. Un-group or re-group every record depending on the price column:

$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file
Item1:2010:10
Item1:2010:20
Item1:2010:30
Item2:2012:12
Item2:2012:29
Item2:2012:19
Item3:2014:15
Item3:2014:50
Item3:2014:61
     The requirement here is:  New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.

5-6. Read file in which the delimiter is square brackets:

$ cat file
123;abc[202];124
125;abc[203];124
127;abc[204];124
  5.  To print the value present within the brackets:

$ awk -F '[][]' '{print $2}' file
202
203
204
     At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is  ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.

Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.

  6.  To print the first value, the value within brackets, and the last value:

$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file
123;202;124
125;203;124
127;204;124
     3 delimiters are used in this case with semi-colon also included.

7-8. Read or parse a file containing a series of delimiters:

$ cat file
123;;;202;;;203
124;;;213;;;203
125;;;222;;;203
      The above file contains a series of 3 semi-colons between every 2 values.

    7. Using the multiple delimiter method:

$ awk -F'[;;;]' '{print $2}' file



    Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.

     8. Using the delimiter without square brackets:

$ awk -F';;;' '{print $2}' file
202
213
222
     The expected output !!!  No square brackets is used and we got the output which we wanted.
Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.

9. Read or parse a file containing a series of delimiters of varying lengths:
      In the below file, the 1st and 2nd column are separated  using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons

$ cat file
123;;;202;;;;203
124;;;213;;;;203
125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file
202 203
213 203
222 203
      The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.

10.  Using a word as a delimiter:

$ cat file
123Unix203
124Unix203
125Unix203
     Retrieve the numbers before and after the word "Unix" :

$ awk -F'Unix' '{print $1, $2}' file
123 203
124 203
125 203
     In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.

P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.

awk - Passing awk variables to shell
In one of our earlier articles, we discussed how to access or pass shell variables to awk. In this, we will see how to access the awk variables in shell? Or How to access awk variables as shell variables ?  Let us see the different ways in which we can achieve this.

 Let us consider a file with the sample contents as below:

$ cat file
Linux 20
Solaris 30
HPUX 40

1. Access the value of the entry "Solaris" in a shell variable, say x:

$ x=`awk '/Solaris/{a=$2;print a}' file`
$ echo $x
30
This approach is fine as long as we want to access only one value. What if we have to access multiple values in shell?

 2. Access the value of "Solaris" in x, and "Linux" in y:

$ z=`awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print "y="$2}' file`
$ echo "$z"
y=20
x=30
$ eval $z
$ echo $x
30
$ echo $y
20
   awk sets the value of "x" and "y" awk variables and prints which is collected in the shell variable "z". The eval command evaluates the variable meaning it executes the commands present in the variable. As a result, "x=30" and "y=20" gets executed, and they become shell variables x and y with appropriate values.

3. Same using the sourcing method:

$ awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print "y="$2}' file > f1
$ source f1
$ echo $x
30
$ echo $y
20
Here, instead of collecting the output of awk command in a variable, it is re-directed to a temporary file. The file is then sourced or in other words executed in the same shell. As a result, "x" and "y" become shell variables.
Note: Depending on the shell being used, the appropriate way of sourcing has to be done. The "source" command is used here since the default shell is bash.



EmoticonEmoticon