I’ve always liked the Unix software toolbox concept: a variety of focused, easily-understood, easily-composed tools that each do approximately one thing. awk is one of my favourites, so I thought I’d write about how I most often apply it, and how you can get more from it too.

In the following examples, I’ll use output from ps, a useful source of test data that is universally available on BSD, Linux and macOS systems. Things should be mostly the same on other systems, too. If you’re using Solaris, for example, there’s a BSD-like ps in /usr/ucb/ps.

example: reinventing pgrep, badly, part 1

A use case I’ve seen many times through the years has been “find all the process IDs of X”. Here’s a common pattern, in which we use grep to filter and another grep to avoid showing the first grep in the results:

$ ps aux | grep bash | grep -v grep
jsleeio          29159   0.0  0.0  4298808    888 s010  S+    5:41pm   0:00.93 -bash
jsleeio          28240   0.0  0.0  4298808    888 s009  S     5:37pm   0:01.13 -bash
jsleeio          24008   0.0  0.0  4297784    888 s008  S     2:30pm   0:01.24 -bash

This works fine. The second grep is there to remove the first grep from the output. We don’t actually need to do this if we are a little more careful:

$ ps aux | grep -- '-\<[b]ash\>'
jsleeio          28240   0.0  0.0  4298808    888 s009  S     5:37pm   0:01.13 -bash
jsleeio          24008   0.0  0.0  4297784    888 s008  S     2:30pm   0:01.24 -bash
jsleeio          20896   0.0  0.0  4298808    896 s007  S     1:12pm   0:01.84 -bash

This approach is still somewhat flawed. If we have a user whose username ends with -bash, or contains -bash followed by any of a number of other word-ending characters, it may still yield unexpected matches. Unlikely, perhaps, but possible. We’ll ignore that for the moment and focus on getting the PIDs, which are in the second column. The classic approach would then be:

$ ps aux | grep -- '-\<[b]ash\>' | awk '{ print $2 }'
29159
28240
24008

That is, grep the output and then only print the second column.

example: reinventing pgrep, badly, part 2

It would really be much safer if we could actually constrain our matching to the part of the ps output that we’re interested in — in this case, the 11th whitespace-delimited field. This is where awk really begins to shine. It turns out that we don’t actually need that grep at all:

$ ps aux | awk '$11 ~ /^-?bash$/ { print $2 " " $11 }'
29159 -bash
28240 -bash
24008 -bash

I’ve included the text we’re matching on ($11) here as well to demonstrate that we’re matching the right thing. So what’s actually happening here?

The condition $11 ~ /^-?bash$/ performs a regular expression match against only the contents of the 11th field of each record. The following block, containing the print statement, will only be executed for records for which the regular expression matched. Here’s a more complex example, with two conditions:

$ ps aux | awk '$11 ~ /\/usr/ && $11 !~ /d$/ { print $2 " " $11 }' | head -3
29158 /usr/bin/login
28239 /usr/bin/login
26670 /usr/libexec/PerfPowerServices

Here, we’re instead checking that the process name starts with /usr and does not , via the !~ operator, end with d.

example: searching processes, but with headers

ps, like many other Unix system interrogation utilities, helpfully prints column headings. But if you’re filtering it with grep, you tend to lose them. Wouldn’t it be nice if we could… keep them?

With awk, it’s quite simple. In the below example, we’re searching for processes in /usr that end with d, but we also add a condition to check if we’re looking at the first record:

$ ps aux | awk 'NR == 1 || ( $11 ~ /d$/ && $11 ~ /^\/usr/ )'
USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
_hidd              105   4.3  0.0  4335552   8052   ??  Ss   Mon01pm   6:07.27 /usr/libexec/hidd
root             26661   0.0  0.0  4331500   6024   ??  Ss    3:46pm   0:00.42 /usr/sbin/ocspd
jsleeio           9510   0.0  0.0  4304832   4276   ??  S    11:33pm   0:00.03 /usr/libexec/mobileactivationd

This example also shows that awk allows grouping of conditions with parentheses, and that if no block is specified, the default action is to simply output the matching records as-is.

wrapping up

In this post I’ve demonstrated a few features of awk:

  • regular expression matching on text fields
  • printing a specific record number
  • conditional execution of awk code blocks
  • the default behaviour being print

I hope some of this is useful. I have more ideas for awk posts and hope to write them soon.

PS: if you need pgrep functionality, use pgrep!