3 copyright: 2024 Nick Bowler
5 published: 2024-06-29T14:41:04-0400
8 *[ERE]: Extended Regular Expression
10 I have successfully ported several nontrivial awk programs to work on quite a
11 lot of real-world awk implementations. If I have learned anything from this
12 process, it is that even though almost every system of interest has a
13 POSIX-like "new awk" implementation, which is a great tool for writing
14 portable scripts, there is quite a lot of variation in details. There
15 is simply no substitute for real-world interoperability testing.
17 This article describes actual issues I have encountered, and suggests
18 workarounds for each. Specific operating systems and versions are mentioned
19 in examples for the aid of reproducing results. This indicates which systems
20 I actually ran the examples on to demonstrate a particular implementation
21 behaviour. It is not intended to suggest that a particular problem only
22 occurs on that one particular OS version (even though that may indeed be the
25 For discussion of traditional (pre-POSIX "old awk"), the GNU Autoconf
26 manual has a lot of details, but it is relatively sparse in its coverage
27 of portability problems amongst POSIX "new awk" implementations.
31 * POSIX specifies that `awk -f -` reads its program from standard input.
32 However, AIX 7.2 awk reads from a file named `-` instead:
36 awk: 0602-546 Cannot find or open file -.
39 To work around this problem, pass the program directly as an argument
40 or write it to a named file first.
42 * Solaris 8 nawk requires a space between the `-f` option and its argument:
45 solaris8% echo 'BEGIN { print "hello"; }' | nawk -f-
46 nawk: no program filename
47 solaris8% echo 'BEGIN { print "hello"; }' | nawk -f -
51 * ULTRIX 4.5 nawk does not support the `-v` option. You can typically use
52 an assignment right after the program instead, for example:
55 ultrix45% echo | nawk '{ print var; }' var=hello
59 but note that such assignments are performed after `BEGIN` actions.
61 * POSIX specifies that backslash escape sequences are evaluated in command-line
62 variable assignments as if they appeared in a string literal in the awk
63 program. However, ULTRIX 4.5 nawk interprets backslashes literally, for
67 gnu% echo | gawk '{ print var; }' var='\\'
69 ultrix45% echo | nawk '{ print var; }' var='\\'
73 Replace backslashes with some other character(s) to avoid this problem.
74 You can use gsub to restore them inside the awk program (but see notes
75 on substituting literal backslashes, below).
79 * Normally, assignment to `$0` recomputes `NF`. However, with AIX 7.2 awk,
80 this does not happen for such assignments in `END` actions. In this case,
81 `NF` retains its prior value:
84 aix72% echo a b c | awk 'END { $0 = "x"; print NF; }'
88 You can use the `split` function instead to work around this issue.
90 * ULTRIX 4.5 nawk has a bug where sometimes the wrong number of characters
91 are copied if `$0` is assigned to another variable after it has been
92 directly modified by the program. For example:
95 ultrix45% echo x | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
97 ultrix45% echo xx | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
101 This bug only occurs with `$0`, and can be avoided with an intervening
102 assignment to one of the field variables, or if the assignment uses
103 `$0` in a slightly more complex expression, such as:
106 ultrix45% echo x | nawk '{ $0 = "hello"; $1 = $1; x = $0; print x "rld"; }'
108 ultrix45% echo x | nawk '{ $0 = "hello"; x = "" $0; print x "rld"; }'
114 * AIX 7.2 awk fails to substitute a replacement string containing `"\1"`
115 (start-of-heading) characters with either the `sub` or `gsub` functions.
116 Any `"\1"` characters in the replacement are silently changed to ampersands
120 aix72% awk 'BEGIN { s="x"; sub("x","\1",s); sub("\1","x",s); print s; }'
124 The issue only affects characters in the replacement text. If some other
125 character can be used instead of `"\1"`, there is no problem with (`g`)`sub`.
126 Otherwise, use `index`, `match` and/or `substr` instead of (`g`)`sub`.
128 * ULTRIX 4.5 nawk does not understand octal escapes in ERE literals, but
129 it works as expected when a string containing such characters is converted
130 to a regexp. For example:
133 ultrix45% echo '\01' | nawk '/\1/ { print "match"; }'
134 awk: syntax error "number in \[0-9] invalid" in /\1/
137 ultrix45% echo '\01' | nawk '$0 ~ "\1" { print "match"; }'
141 # Substituting Literal Backslashes
143 * Various awk implementations differ in their handling of backslashes in the
144 replacement strings passed to `sub` or `gsub`.
146 If the replacement string is `"\\\\\\\\"` (i.e., contains four consecutive
147 backslashes), then GNU gawk and ULTRIX 4.5 nawk will substitute two
148 backslashes, while most other systems substitute four:
151 gnu% echo /x/ | gawk 'gsub(/x/, "\\\\\\\\")'
153 gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\\\\\")'
155 aix72% echo /x/ | awk 'gsub(/x/, "\\\\\\\\")'
157 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\\\\\")'
161 If the replacement string is `"\\\\"` (two backslashes), then GNU gawk
162 (in POSIX-conforming mode) and ULTRIX 4.5 nawk will substitute one
163 backslash, while most other systems substitute two:
166 gnu% echo /x/ | gawk 'gsub(/x/, "\\\\")'
168 gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\")'
170 aix72% echo /x/ | awk 'gsub(/x/, "\\\\")'
172 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\")'
176 If the replacement string is `"\\"` (one backslash), then most
177 implementations will substitute a single backslash, except ULTRIX 4.5
178 nawk will substitute nothing, and then for good measure also deletes
179 the rest of the input string:
182 gnu% echo /x/ | gawk 'gsub(/x/, "\\")'
184 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\")'
188 To work around all of these differences, construct a replacement string based
189 on a runtime probe of what actually happens:
193 bs="x"; sub(/x/, "\\\\", bs);
194 bs = (length(bs) == 1 ? "\\\\" : "\\" );
196 gsub(/x/, bs) # portably substitute a single backslash
197 gsub(/y/, bs bs) # portably substitute two consecutive backslashes
200 # Function Definitions
202 * POSIX specifies that if a function call has less arguments than the number
203 of parameters in the function definition, then the additional parameters
204 are treated as uninitialized scalars or arrays depending on how they are
205 used in the function body.
207 However, ULTRIX 4.5 nawk treats all such excess parameters as scalars and
208 using them as arrays in the function body leads to unpredictable results.
210 To work around this problem, use a global array with a unique name instead,
211 and explicitly delete all its elements at the beginning or the end of the
212 function (such as by writing `split("",global_array_name)`).
216 * HP-UX 11 awk exits with an error if any input line contains more than 199
217 fields. If this might be a problem, set `FS` to some garbage and, if
218 necessary, use the `split` function.
220 * HP-UX 11 awk exits with an error if any input line exceeds 3070 bytes.
221 This applies to normal input and all variations of the getline function.
222 You might be able to preprocess the input to split long lines before
223 further processing in awk.
225 * ULTRIX 4.5 nawk regular expression matching fails if too much data would be
226 matched by the `*` or `+` regex operators. The exact limit varies depending
227 on the regex, but for example `.*` fails to match a substring longer than
231 ultrix45% nawk 'BEGIN { x="x"; for (i = 0; i < 12; i++) x = (x x);
232 match(x, /...*/); print length(x), RSTART, RLENGTH;
233 match(x, /....*/); print length(x), RSTART, RLENGTH;
239 # Expression Evaluation
241 * HP-UX 11 awk misparses most expressions where the unary `!` operator is
242 used as the operand of a binary operator, for example:
245 hpux11% awk 'BEGIN { print !0 + 1; }'
247 hpux11% awk 'BEGIN { print 1 + !0; }'
248 syntax error The source line is 1.
250 BEGIN { print 1 + >>> ! <<< 0 }
251 awk: The statement cannot be correctly parsed.
252 The source line is 1.
255 Add parentheses to avoid the problem:
258 hpux11% awk 'BEGIN { print 1 + (!0), (!0) + 1; }'
262 * POSIX specifies that pattern expressions are boolean context and that in such
263 contexts nonempty strings are _true_ and empty strings are _false_. However,
264 ULTRIX 4.5 nawk treats pattern expressions as integers, thus strings which
265 convert to a nonzero integer are _true_ and all other strings are _false_:
268 gnu% echo x | gawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
270 ultrix45% echo x | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
272 ultrix45% echo 9 | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
276 The same bug also occurs if a string is used as the operand of any of
277 the logical operators (`!`, `&&` or `||`), but the bug does not occur
278 if a string is used as the first operand of the `?:` operator, or if
279 a string is used as the conditional expression of an `if`, `for` or
282 Use an explicit string comparison (e.g., `$0 != ""`) to work around this
285 # Regular Expressions
287 * To include a literal closing bracket in a character class, busybox awk
288 accepts only the form `[]]`, while `[\]]` is interpreted as "backslash,
289 followed by a closing bracket". Meanwhile, Solaris 8 nawk accepts only
290 the form `[\]]` and `[]]` fails to match any input. GNU awk accepts
291 either form as equivalent. For example:
294 alpine% printf 'a]\nb\\]\n' | busybox awk '/^.[\]]$/; /^.[]]$/'
297 gnu% printf 'a]\nb\\]\n' | gawk '/^.[\]]$/; /^.[]]$/'
300 solaris8% printf 'a]\nb\\]\n' | nawk '/^.[\]]$/; /^.[]]$/'
304 For a normal (non-complemented) character class, you can use the
305 equivalent `(]|[xyz])` instead. For a complemented class, in general
306 there is no equivalent (and portable) regular expression, so the code
307 must be restructured to avoid using such classes (probably by using
308 some of awk's other string manipulation features).
310 * ULTRIX 4.5 nawk will prefer the left alternative of the `|` regular
311 expression operator (rather than the POSIX-specified longest matching
312 substring) in cases where both alternatives match but the left
313 alternative is shorter:
316 ultrix45% echo abcd | nawk 'sub(/a|abc/, "#") { print; }'
318 ultrix45% echo abcd | nawk 'sub(/abc|a/, "#") { print; }'
320 ultrix45% echo abcd | nawk 'sub(/(abc|a)d/, "#") { print; }'
324 In many situations this does not actually make a difference but it can
325 affect, for example, the result of the `sub`, `gsub` and `match` functions.
326 Try to arrange for the alternatives to be mutually exclusive, or for the
327 left alternative to match at least as much text as the right.
329 # Particular Functions
331 * Busybox awk exits with an error if you attempt to use `*` in `printf` or
332 `sprintf` conversions, for example:
335 alpine% busybox awk 'BEGIN { printf "%*s\n", 10, "hello"; }'
336 awk: cmd. line:1: %*x formats are not supported
339 Generate the format string dynamically to work around this issue.
341 * Busybox awk does not support using `getline <"-"` to read a line from
342 standard input. It will read from a file named `-` instead. On the
343 other hand, specifying a filename of `-` on the command line or by
344 modifying the `ARGV` array does work to read from standard input.
346 Normally it is easy enough to structure the program so that it is
347 not required to read from standard input while the normal awk input
348 is something else, in which case this is not a serious limitation.
350 If a workaround for this issue is truly needed, `"cat" | getline`
351 can be used with busybox awk to read from standard input.
353 * ULTRIX 4.5 nawk behaves unpredictably if the third argument to `split` is
354 an ERE literal. Use a string (which is converted to a regexp) instead.