Diving deeper into AWK
I’ve always liked the Unix software toolbox concept: a variety of focused,
easily-understood, easily-composed tools that each do approximately one thing.
awk
is one of my favourites, so I
thought I’d write about how I most often apply it, and how you can get more
from it too.
In the following examples, I’ll use output from ps
, a useful source of test
data that is universally available on BSD, Linux and macOS systems. Things
should be mostly the same on other systems, too. If you’re using Solaris, for
example, there’s a BSD-like ps
in /usr/ucb/ps
.
example: reinventing pgrep, badly, part 1
A use case I’ve seen many times through the years has been “find all the
process IDs of X”. Here’s a common pattern, in which we use grep
to filter
and another grep
to avoid showing the first grep
in the results:
$ ps aux | grep bash | grep -v grep
jsleeio 29159 0.0 0.0 4298808 888 s010 S+ 5:41pm 0:00.93 -bash
jsleeio 28240 0.0 0.0 4298808 888 s009 S 5:37pm 0:01.13 -bash
jsleeio 24008 0.0 0.0 4297784 888 s008 S 2:30pm 0:01.24 -bash
This works fine. The second grep
is there to remove the first grep
from
the output. We don’t actually need to do this if we are a little more careful:
$ ps aux | grep -- '-\<[b]ash\>'
jsleeio 28240 0.0 0.0 4298808 888 s009 S 5:37pm 0:01.13 -bash
jsleeio 24008 0.0 0.0 4297784 888 s008 S 2:30pm 0:01.24 -bash
jsleeio 20896 0.0 0.0 4298808 896 s007 S 1:12pm 0:01.84 -bash
This approach is still somewhat flawed. If we have a user whose username ends
with -bash
, or contains -bash
followed by any of a number of other
word-ending characters, it may still yield unexpected matches. Unlikely,
perhaps, but possible. We’ll ignore that for the moment and focus on getting
the PIDs, which are in the second column. The classic approach would then be:
$ ps aux | grep -- '-\<[b]ash\>' | awk '{ print $2 }'
29159
28240
24008
That is, grep
the output and then only print the second column.
example: reinventing pgrep, badly, part 2
It would really be much safer if we could actually constrain our matching to
the part of the ps
output that we’re interested in — in this case, the 11th
whitespace-delimited field. This is where awk
really begins to shine. It
turns out that we don’t actually need that grep
at all:
$ ps aux | awk '$11 ~ /^-?bash$/ { print $2 " " $11 }'
29159 -bash
28240 -bash
24008 -bash
I’ve included the text we’re matching on ($11
) here as well to demonstrate
that we’re matching the right thing. So what’s actually happening here?
The condition $11 ~ /^-?bash$/
performs a regular expression match against
only the contents of the 11th field of each record. The following block,
containing the print
statement, will only be executed for records for which
the regular expression matched. Here’s a more complex example, with two
conditions:
$ ps aux | awk '$11 ~ /\/usr/ && $11 !~ /d$/ { print $2 " " $11 }' | head -3
29158 /usr/bin/login
28239 /usr/bin/login
26670 /usr/libexec/PerfPowerServices
Here, we’re instead checking that the process name starts with /usr
and
does not , via the !~
operator, end with d
.
example: searching processes, but with headers
ps
, like many other Unix system interrogation utilities, helpfully prints
column headings. But if you’re filtering it with grep
, you tend to lose them.
Wouldn’t it be nice if we could… keep them?
With awk
, it’s quite simple. In the below example, we’re searching for
processes in /usr that end with d
, but we also add a condition to check if
we’re looking at the first record:
$ ps aux | awk 'NR == 1 || ( $11 ~ /d$/ && $11 ~ /^\/usr/ )'
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
_hidd 105 4.3 0.0 4335552 8052 ?? Ss Mon01pm 6:07.27 /usr/libexec/hidd
root 26661 0.0 0.0 4331500 6024 ?? Ss 3:46pm 0:00.42 /usr/sbin/ocspd
jsleeio 9510 0.0 0.0 4304832 4276 ?? S 11:33pm 0:00.03 /usr/libexec/mobileactivationd
This example also shows that awk
allows grouping of conditions with
parentheses, and that if no block is specified, the default action is to simply
output the matching records as-is.
wrapping up
In this post I’ve demonstrated a few features of awk
:
- regular expression matching on text fields
- printing a specific record number
- conditional execution of
awk
code blocks - the default behaviour being
print
I hope some of this is useful. I have more ideas for awk
posts and hope to
write them soon.
PS: if you need pgrep
functionality, use pgrep
!