Understanding How To Use the AWK Command

Posted on by Matthew Stevens
Reading Time: 10 minutes

AWK stands for “Aho Weinberg Kernighan” and are the last names of people who invented it: Alfred Aho, Peter Weinberg, and Brian Kernighan. The purpose of AWK is to search existing files to find lines that match certain patterns. It is a full scripting language, as well as a complete text manipulation toolkit. It is data-driven, meaning you define a set of actions to be performed on provided text, and it sends results to standard output.

With AWK, we can:

  • Scan a file line by line.
  • Split each input line into fields.
  • Compare input lines or fields to patterns.
  • Perform actions on matched lines.

Patterns are enclosed in slashes (//), actions are enclosed in braces ({}), and the entire AWKprogram is enclosed in single quotes (‘). The default delimiter for the awk command is any whitespace character like space or tab. If there is no pattern in the awk command, then all lines from the provided file will be matched.
Let’s see the contents of the current folder with the ls -l command.

[mstevens@host public_html]$ ls -l
total 12
-rw-rw-r--. 1 mstevens mstevens 6426 Feb  9 08:00 access_log
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 config.php
-rw-r--r--. 1 mstevens mstevens 3661 Mar 19 04:31 dovecot.log
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 error_log
-rwxrwxrwx. 1 mstevens mstevens    0 Mar 19 04:49 everyone.txt
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 index.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:49 list.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:49 login.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 24 03:14 php.ini

The output of the ls command shows the total number of blocks (which in this case is 12) and contains nine fields (from left to right):

  1. Permissions
  2. The number of connections
  3. User
  4. Group
  5. Size
  6. Month
  7. Day
  8. Time of the last update
  9. Filename

If, for example, we only need to print out permissions and filenames, we can pipe the ls -l command into AWKand tell it to print the first and ninth fields.
The simple AWKprogram below has no pattern, only actions, so it will review and match every line of text provided by showing only the first and ninth fields on each line.

[mstevens@host public_html]$ ls -l | awk '{print $1,$9}'
total
-rw-rw-r--. access_log
-rw-rw-r--. config.php
-rw-r--r--. dovecot.log
-rw-rw-r--. error_log
-rwxrwxrwx. everyone.txt
-rw-rw-r--. index.php
-rw-rw-r--. list.php
-rw-rw-r--. login.php
-rw-rw-r--. php.ini

As you can see, the ls command output has 10 lines of text, including the line with the word total. The word total is the first field on its line, and the number 12 was the second field on its line. Only total is returned in the output because the awk command requested the first and ninth fields. To avoid matching lines that are not needed, we can provide a pattern, and only lines with this pattern will be output.

Pattern Matching

Patterns in AWK are used to show specific actions on lines that match a given pattern. The same thing can be accomplished with a grep command to find certain information in the provided text or files. The only difference is that we don’t need to combine multiple commands; we just need to use one awk command.

AWKsupports different types of patterns:

  • Regular expression patterns
  • Relation expression patterns
  • Range patterns
  • Special expressions

Regular Expression Patterns

The most basic example is string matching. If we want to get only lines with the word php, we can add a pattern in the awk command between slashes (//). As shown below, no matter where word php is located in the line, those files are displayed in the output.

[mstevens@host public_html]$ ls -l | awk '/php/ {print $1,$9}'
-rw-rw-r--. config.php
-rw-rw-r--. index.php
-rw-rw-r--. list.php
-rw-rw-r--. login.php
-rw-rw-r--. php.ini

Regex Syntax Characters

A regular expression is a pattern describing a certain amount of text. Not to confuse it with “Regular expression pattern,” which is one of the awk patterns, I will use "regex," which is also widely used in IT. 

Certain characters have special meanings when used in regex.

Anchors

Anchors do not match any character. Instead, they match a position before or after characters.

Characters

You can match characters that follow specific rules.

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

With this information, we can now use it to find all PHP files. We can use /php$/ in the command to find all lines that end with php.

[mstevens@host public_html]$ ls -l | awk '$9 ~ /php$/ {print $1,$9}'
-rw-rw-r--. config.php
-rw-rw-r--. index.php
-rw-rw-r--. list.php
-rw-rw-r--. login.php

In the current folder, there are only four PHP files. The file php.ini was excluded because php is not at the end of the string. 

Relational Expression Patterns

By default, regular expression patterns are matched against the whole line.  Relational expression patterns match the content of a specified field with the provided pattern.

To match a pattern against a field, we would need to specify the comparison operator (~) against a pattern:

  • Match lines: $n ~ /pattern/
  • Not match lines: $n !~ /pattern/

The placeholder $n is the number of fields used to match the provided pattern. Now let’s use our previous example.

ls -l | awk '$9 ~ /php/ {print $1,$9}

The $9 ~ /php/ will match the 9th field with the word php

[mstevens@host public_html]$ ls -l | awk '$9 ~ /php/ {print $1,$9}'
-rw-rw-r--. config.php
-rw-rw-r--. index.php
-rw-rw-r--. list.php
-rw-rw-r--. login.php
-rw-rw-r--. php.ini

If I tried using the first field (permissions) there wouldn’t be any results since the first field only contains characters like -rwxr-xr--. (which stands for read, write, execute). 

[mstevens@host public_html]$ ls -l | awk '$1 ~ /php/ {print $1,$9}'
[mstevens@host public_html]$

Range Patterns

Range patterns consist of two patterns separated by a comma. This allows us to print all records from the line that matches the first pattern until the second pattern is matched.

/pattern1/, /pattern2/

In this example I want to print all files from the line that matches config and up to file that matches index. The command is shown below.

[mstevens@host public_html]$ ls -l | awk '/config/,/index/ { print $0 }'
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 config.php
-rw-r--r--. 1 mstevens mstevens 3661 Mar 19 04:31 dovecot.log
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 error_log
-rwxrwxrwx. 1 mstevens mstevens    0 Mar 19 04:49 everyone.txt
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 index.php

We could also match characters in lines that follow defined rules. Let’s say you want to find all lines containing the letter l, followed by the letter o or i. Create the below command.

[mstevens@host public_html]$ ls -l | awk '$9 ~ /l[oi]/ {print $1,$9}'
-rw-rw-r--. access_log
-rw-r--r--. dovecot.log
-rw-rw-r--. error_log
-rw-rw-r--. list.php
-rw-rw-r--. login.php

As shown above, log, list, and login are the words matching regex used in the awk command. 

Quantifiers can be used if there is a certain character repeating in the provided text. I have created a file with the following content.

[mstevens@host public_html]$ cat test.txt
1. a b c d
2. d c b a
3. aa bb cc dd
4. dd cc bb aa
5. aaa bbb ccc ddd
6. ddd ccc bbb aaa

To find all lines that contain three a characters (aaa) and have at least one subsequent c character, I would use the following command.

awk '/a{3}.*c/ {print $0}' test.txt

The output indicates one line contains aaa with at least one character c following afterward.

[mstevens@host public_html]$ awk '/a{3}.*c/ {print $0}' test.txt
5. aaa bbb ccc ddd

Special Expressions

Variables within AWK can be set at any line in the program. AWK includes the following special patterns:

  • BEGIN - Carries out its corresponding action before the first record is read and is generally used to defines variables for the entire program.
  • END - Performs its action after the last record is read from the input file.

AWK has several built-in variables that allow you to control how the program is processed. Here are some of the most common built-in variables.

Now let’s use NR in our command to check the number of lines in test.txt. As we see below, there are six lines within the file.

[mstevens@host public_html]# awk 'END { print FILENAME, "contains", NR, "lines." }' test.txt
test.txt contains 6 lines.

Changing the Separator

The separator is any character that divides lines of text into fields. The default field separator is any number of whitespace characters like space or tab, but you can change the separator with the FS variable or -F flag in the awk command.

Using the FS Variable

First, we will show how to use the FS variable. Below we have the current lines in test.txt with the fields separated by white spaces.

[mstevens@host public_html]$ cat test.txt
1. a b c d
2. d c b a
3. aa bb cc dd
4. dd cc bb aa
5. aaa bbb ccc ddd
6. ddd ccc bbb aaa

For easier readability, the image below shows the information above, with the white spaces highlighted in green.

fs-variable-default-white-spaces

Now, I will separate the fields by the c character and print the first field. This means the existing white spaces will no longer separate each field and are regular characters. Everything before the first c in a line will be part of the first field and will be printed. All remaining information on the lines is part of subsequent fields and will not be included in the output.

[mstevens@host public_html]$ awk 'BEGIN { FS = "c" } { print $1 }' test.txt
1. a b
2. d
3. aa bb
4. dd
5. aaa bbb
6. ddd

Again, we have the output from above shown below with the separator (c).

fs-variable-c-separator

Because the separator creates an additional field, the number of c’s on a line will increase the number of fields present. Two fields are present on lines 1 and 2, three fields are on lines 3 and 4, and four fields are present on lines 5 and 6. We can better see this in the image below. The area between each green separator represents an additional field.

fs-variable-c-separator-all-fields

Using the -F Flag

Now we will change the separator in an awk command using the -F flag and work through another example.

awk -F'c' '{ print $1 }' test.txt

Below shows our previous folder contents from earlier in the article.

[mstevens@host public_html]$ ls -l
total 12
-rw-rw-r--. 1 mstevens mstevens 6426 Feb  9 08:00 access_log
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 config.php
-rw-r--r--. 1 mstevens mstevens 3661 Mar 19 04:31 dovecot.log
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 error_log
-rwxrwxrwx. 1 mstevens mstevens    0 Mar 19 04:49 everyone.txt
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:48 index.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:49 list.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 19 04:49 login.php
-rw-rw-r--. 1 mstevens mstevens    0 Mar 24 03:14 php.ini

By utilizing a few records from dovecot.log, we can determine if someone is trying to access the email accounts by incorporating the awk command. We have examples of failed and successful connections.

It’s not pretty, but we can break the connection output into smaller pieces. The most important values in these logs to focus on are:

  • imap-login - Indicates that someone tried to log into an email account.
  • user= - Shows which email account that person is trying to access.
  • rip= - The IP that is trying to connect.

The following command will output all of the IPs that failed to connect to an email account.

[mstevens@host public_html]$ awk -F'rip=' '/imap-login/&&/failed/ {print $1, $2}' dovecot.log | awk -F'user=' '{print $2}' | awk -F, '{print $3,$1}'
  127.0.0.1 <mstevens@liquidweb.com>
  127.0.0.1 <mstevens@liquidweb.com>
  127.0.0.1 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>
  50.50.50.50 <mstevens@liquidweb.com>

If you see suspicious activity, someone could be attempting a brute force attack on your server. Update your password(s) as soon as possible and take steps to prevent attacks in the future, like implementing two-factor authentication (2FA) and enabling CAPTCHA.

Using AWK With sub() and gsub()

AWK features several functions that perform find-and-replace actions like the sed command. The sub function substitutes the first matched entity in a record with a provided string. I’m going to show this on the test.txt file.

The part of the command that reads sub(/a/, "X", $2); will substitute the letter a with a letter X in the second field. Only the first, third, and fifth lines will be affected since the lines contain the letter a on the second field.

[mstevens@host public_html]$ awk '{sub(/a/, "X", $2); print $0}' test.txt
1. X b c d
2. d c b a
3. Xa bb cc dd
4. dd cc bb aa
5. Xaa bbb ccc ddd
6. ddd ccc bbb aaa

While this change will only be shown in the terminal and won’t change the file, we can redirect the output to a different file to save the changes. The sub function is used when we need to replace certain information within a file, like a site URL in sql files, while still preserving the original sql file.

The second function is gsub, and while it has the same syntax, the only difference is that it will replace all values found in the provided fields, not just the first character. Again, the first, third, and fifth lines are affected, but instead of only the first a character in the line changing to X, all a characters in the first field are changed to X.

[mstevens@host public_html]$ awk '{gsub(/a/, "X", $2); print $0}' test.txt
1. X b c d
2. d c b a
3. XX bb cc dd
4. dd cc bb aa
5. XXX bbb ccc ddd
6. ddd ccc bbb aaa

Conclusion

AWK is a powerful tool that can replace commands like grep, sed, and many others to find patterns within files. Depending on what is needed, all patterns can be changed to output the desired information. Test out the commands mentioned in this article on your own server and see what patterns you can find!

To learn more about Liquid Webs solutions, please visit our product overview page to learn more.  Our Managed Hosting line of products is robust enough for businesses of every size, from early-stage startups to mature businesses requiring enterprise hosting environments.

About the Author: Matthew Stevens

I'm a system administrator, developer, and I'm constantly improving and learning new skills. In my spare time I keep myself in shape with dancing... breakdancing!

Have Some Questions?

Our Sales and Support teams are available 24 hours by phone or e-mail to assist.

1.800.580.4985
1.517.322.0434

Latest Articles

How to Use Material-UI in React

Read Article

Cloning an Existing Virtual Machine with VMware

Read Article

Five Steps to Create a Robots.txt File for Your Website

Read Article

Premium Business Email Pricing FAQ

Read Article

Microsoft Exchange Server Security Update

Read Article