3 copyright: 2024 Nick Bowler
5 published: 2024-06-29T14:41:04-0400
6 updated: 2024-11-17T23:06:42-0500
9 *[ERE]: Extended Regular Expression
11 I have successfully ported several nontrivial awk programs to work on quite a
12 lot of real-world awk implementations. If I have learned anything from this
13 process, it is that even though almost every system of interest has a
14 POSIX-like "new awk" implementation, which is a great tool for writing
15 portable scripts, there is quite a lot of variation in details. There
16 is simply no substitute for real-world interoperability testing.
18 This article describes actual issues I have encountered, and suggests
19 workarounds for each. Specific operating systems and versions are mentioned
20 in examples for the aid of reproducing results. This indicates which systems
21 I actually ran the examples on to demonstrate a particular implementation
22 behaviour. It is not intended to suggest that a particular problem only
23 occurs on that one particular OS version (even though that may indeed be the
26 For discussion of traditional (pre-POSIX "old awk"), the GNU Autoconf
27 manual has a lot of details, but it is relatively sparse in its coverage
28 of portability problems amongst POSIX "new awk" implementations.
32 * POSIX specifies that `awk -f -` reads its program from standard input.
33 However, AIX 7.2 awk reads from a file named `-` instead:
37 awk: 0602-546 Cannot find or open file -.
40 To work around this problem, pass the program directly as an argument
41 or write it to a named file first.
43 * Solaris 8 nawk requires a space between the `-f` option and its argument:
46 solaris8% echo 'BEGIN { print "hello"; }' | nawk -f-
47 nawk: no program filename
48 solaris8% echo 'BEGIN { print "hello"; }' | nawk -f -
52 * ULTRIX 4.5 nawk does not support the `-v` option. You can typically use
53 an assignment right after the program instead, for example:
56 ultrix45% echo | nawk '{ print var; }' var=hello
60 but note that such assignments are performed after `BEGIN` actions.
62 * POSIX specifies that backslash escape sequences are evaluated in command-line
63 variable assignments as if they appeared in a string literal in the awk
64 program. However, ULTRIX 4.5 nawk interprets backslashes literally, for
68 gnu% echo | gawk '{ print var; }' var='\\'
70 ultrix45% echo | nawk '{ print var; }' var='\\'
74 Replace backslashes with some other character(s) to avoid this problem.
75 You can use gsub to restore them inside the awk program (but see notes
76 on substituting literal backslashes, below).
78 * IRIX 6.2 nawk evaluates and removes leading command-line variable assignments
79 before executing BEGIN actions. This is mostly a problem for actions using
80 `ARGC` and `ARGV`, for example:
83 irix62% nawk -v a=X 'BEGIN { print a, ARGC, ARGV[2]; }' a=Y b c=
85 gnu% gawk -v a=X 'BEGIN { print a, ARGC, ARGV[2]; }' a=Y b c=
89 Ensure the first program argument does not look like a variable assignment
90 in order to avoid this problem.
94 * Normally, assignment to `$0` recomputes `NF`. However, with AIX 7.2 awk,
95 this does not happen for such assignments in `END` actions. In this case,
96 `NF` retains its prior value:
99 aix72% echo a b c | awk 'END { $0 = "x"; print NF; }'
103 You can use the `split` function instead to work around this issue.
105 * ULTRIX 4.5 nawk has a bug where sometimes the wrong number of characters
106 are copied if `$0` is assigned to another variable after it has been
107 directly modified by the program. For example:
110 ultrix45% echo x | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
112 ultrix45% echo xx | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
116 This bug only occurs with `$0`, and can be avoided with an intervening
117 assignment to one of the field variables, or if the assignment uses
118 `$0` in a slightly more complex expression, such as:
121 ultrix45% echo x | nawk '{ $0 = "hello"; $1 = $1; x = $0; print x "rld"; }'
123 ultrix45% echo x | nawk '{ $0 = "hello"; x = "" $0; print x "rld"; }'
129 * AIX 7.2 awk fails to substitute a replacement string containing `"\1"`
130 (start-of-heading) characters with either the `sub` or `gsub` functions.
131 Any `"\1"` characters in the replacement are silently changed to ampersands
135 aix72% awk 'BEGIN { s="x"; sub("x","\1",s); sub("\1","x",s); print s; }'
139 The issue only affects characters in the replacement text. If some other
140 character can be used instead of `"\1"`, there is no problem with (`g`)`sub`.
141 Otherwise, use `index`, `match` and/or `substr` instead of (`g`)`sub`.
143 * ULTRIX 4.5 nawk does not understand octal escapes in ERE literals, but
144 it works as expected when a string containing such characters is converted
145 to a regexp. For example:
148 ultrix45% echo '\01' | nawk '/\1/ { print "match"; }'
149 awk: syntax error "number in \[0-9] invalid" in /\1/
152 ultrix45% echo '\01' | nawk '$0 ~ "\1" { print "match"; }'
156 * IRIX 6.2 nawk fails to match strings containing newlines against an ERE
157 literal with `\n`, but it works as expected when a string containing a
158 newline is converted to a regexp. For example:
161 irix62% nawk 'BEGIN { print ("foo\nbar" ~ /\n/); }'
163 irix62% nawk 'BEGIN { print ("foo\nbar" ~ "\n"); }'
167 # Substituting Literal Backslashes
169 * Various awk implementations differ in their handling of backslashes in the
170 replacement strings passed to `sub` or `gsub`.
172 If the replacement string is `"\\\\\\\\"` (i.e., contains four consecutive
173 backslashes), then GNU gawk and ULTRIX 4.5 nawk will substitute two
174 backslashes, while most other systems substitute four:
177 gnu% echo /x/ | gawk 'gsub(/x/, "\\\\\\\\")'
179 gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\\\\\")'
181 aix72% echo /x/ | awk 'gsub(/x/, "\\\\\\\\")'
183 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\\\\\")'
187 If the replacement string is `"\\\\"` (two backslashes), then GNU gawk
188 (in POSIX-conforming mode) and ULTRIX 4.5 nawk will substitute one
189 backslash, while most other systems substitute two:
192 gnu% echo /x/ | gawk 'gsub(/x/, "\\\\")'
194 gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\")'
196 aix72% echo /x/ | awk 'gsub(/x/, "\\\\")'
198 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\")'
202 If the replacement string is `"\\"` (one backslash), then most
203 implementations will substitute a single backslash, except ULTRIX 4.5
204 nawk will substitute nothing, and then for good measure also deletes
205 the rest of the input string:
208 gnu% echo /x/ | gawk 'gsub(/x/, "\\")'
210 ultrix45% echo /x/ | nawk 'gsub(/x/, "\\")'
214 To work around all of these differences, construct a replacement string based
215 on a runtime probe of what actually happens:
219 bs="x"; sub(/x/, "\\\\", bs);
220 bs = (length(bs) == 1 ? "\\\\" : "\\" );
222 gsub(/x/, bs) # portably substitute a single backslash
223 gsub(/y/, bs bs) # portably substitute two consecutive backslashes
226 # Function Definitions
228 * POSIX specifies that if a function call has less arguments than the number
229 of parameters in the function definition, then the additional parameters
230 are treated as uninitialized scalars or arrays depending on how they are
231 used in the function body.
233 However, ULTRIX 4.5 nawk treats all such excess parameters as scalars and
234 using them as arrays in the function body leads to unpredictable results.
236 To work around this problem, use a global array with a unique name instead,
237 and explicitly delete all its elements at the beginning or the end of the
238 function (such as by writing `split("",global_array_name)`).
242 * HP-UX 11 awk exits with an error if any input line contains more than 199
243 fields. If this might be a problem, set `FS` to some garbage and, if
244 necessary, use the `split` function.
246 * HP-UX 11 awk exits with an error if any input line exceeds 3070 bytes.
247 This applies to normal input and all variations of the getline function.
248 You might be able to preprocess the input to split long lines before
249 further processing in awk.
251 * ULTRIX 4.5 nawk regular expression matching fails if too much data would be
252 matched by the `*` or `+` regex operators. The exact limit varies depending
253 on the regex, but for example `.*` fails to match a substring longer than
257 ultrix45% nawk 'BEGIN { x="x"; for (i = 0; i < 12; i++) x = (x x);
258 match(x, /...*/); print length(x), RSTART, RLENGTH;
259 match(x, /....*/); print length(x), RSTART, RLENGTH;
265 # Expression Evaluation
267 * HP-UX 11 awk misparses most expressions where the unary `!` operator is
268 used as the operand of a binary operator, for example:
271 hpux11% awk 'BEGIN { print !0 + 1; }'
273 hpux11% awk 'BEGIN { print 1 + !0; }'
274 syntax error The source line is 1.
276 BEGIN { print 1 + >>> ! <<< 0 }
277 awk: The statement cannot be correctly parsed.
278 The source line is 1.
281 Add parentheses to avoid the problem:
284 hpux11% awk 'BEGIN { print 1 + (!0), (!0) + 1; }'
288 * POSIX specifies that pattern expressions are boolean context and that in such
289 contexts nonempty strings are _true_ and empty strings are _false_. However,
290 ULTRIX 4.5 nawk treats pattern expressions as integers, thus strings which
291 convert to a nonzero integer are _true_ and all other strings are _false_:
294 gnu% echo x | gawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
296 ultrix45% echo x | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
298 ultrix45% echo 9 | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
302 The same bug also occurs if a string is used as the operand of any of
303 the logical operators (`!`, `&&` or `||`), but the bug does not occur
304 if a string is used as the first operand of the `?:` operator, or if
305 a string is used as the conditional expression of an `if`, `for` or
308 Use an explicit string comparison (e.g., `$0 != ""`) to work around this
311 # Regular Expressions
313 * To include a literal closing bracket in a character class, busybox awk
314 accepts only the form `[]]`, while `[\]]` is interpreted as "backslash,
315 followed by a closing bracket". Meanwhile, Solaris 8 nawk accepts only
316 the form `[\]]` and `[]]` fails to match any input. GNU awk accepts
317 either form as equivalent. For example:
320 alpine% printf 'a]\nb\\]\n' | busybox awk '/^.[\]]$/; /^.[]]$/'
323 gnu% printf 'a]\nb\\]\n' | gawk '/^.[\]]$/; /^.[]]$/'
326 solaris8% printf 'a]\nb\\]\n' | nawk '/^.[\]]$/; /^.[]]$/'
330 For a normal (non-complemented) character class, you can use the
331 equivalent `(]|[xyz])` instead. For a complemented class, in general
332 there is no equivalent (and portable) regular expression, so the code
333 must be restructured to avoid using such classes (probably by using
334 some of awk's other string manipulation features).
336 * ULTRIX 4.5 nawk will prefer the left alternative of the `|` regular
337 expression operator (rather than the POSIX-specified longest matching
338 substring) in cases where both alternatives match but the left
339 alternative is shorter:
342 ultrix45% echo abcd | nawk 'sub(/a|abc/, "#") { print; }'
344 ultrix45% echo abcd | nawk 'sub(/abc|a/, "#") { print; }'
346 ultrix45% echo abcd | nawk 'sub(/(abc|a)d/, "#") { print; }'
350 In many situations this does not actually make a difference but it can
351 affect, for example, the result of the `sub`, `gsub` and `match` functions.
352 Try to arrange for the alternatives to be mutually exclusive, or for the
353 left alternative to match at least as much text as the right.
355 # Particular Functions
357 * Busybox awk exits with an error if you attempt to use `*` in `printf` or
358 `sprintf` conversions, for example:
361 alpine% busybox awk 'BEGIN { printf "%*s\n", 10, "hello"; }'
362 awk: cmd. line:1: %*x formats are not supported
365 Generate the format string dynamically to work around this issue.
367 * Busybox awk does not support using `getline <"-"` to read a line from
368 standard input. It will read from a file named `-` instead. On the
369 other hand, specifying a filename of `-` on the command line or by
370 modifying the `ARGV` array does work to read from standard input.
372 Normally it is easy enough to structure the program so that it is
373 not required to read from standard input while the normal awk input
374 is something else, in which case this is not a serious limitation.
376 If a workaround for this issue is truly needed, `"cat" | getline`
377 can be used with busybox awk to read from standard input.
379 * ULTRIX 4.5 nawk behaves unpredictably if the third argument to `split` is
380 an ERE literal. Use a string (which is converted to a regexp) instead.