summaryrefslogtreecommitdiff
path: root/regex/regex.doc
blob: 100be4fbc82e597d907cd4a93e131ce0196e6e30 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462



REGEX(3)               C Library Functions               REGEX(3)



NAME
     regcomp, regexec,  regerror,  regfree  -  regular-expression
     library

SYNOPSIS
     #include <sys/types.h>
     #include <regex.h>

     int regcomp(regex_t *preg, const char *pattern, int cflags);

     int regexec(const regex_t *preg,         const char *string,
               size_t nmatch, regmatch_t pmatch[], int eflags);

     size_t regerror(int errcode,            const regex_t *preg,
               char *errbuf, size_t errbuf_size);

     void regfree(regex_t *preg);

DESCRIPTION
     These routines implement POSIX  1003.2  regular  expressions
     (``RE''s); see re_format(7).  Regcomp compiles an RE written
     as a string into an  internal  form,  regexec  matches  that
     internal form against a string and reports results, regerror
     transforms error codes from either into human-readable  mes-
     sages,  and  regfree frees any dynamically-allocated storage
     used by the internal form of an RE.

     The header <regex.h> declares two structure  types,  regex_t
     and  regmatch_t,  the former for compiled internal forms and
     the latter for match reporting.  It also declares  the  four
     functions,  a  type regoff_t, and a number of constants with
     names starting with ``REG_''.

     Regcomp compiles the regular  expression  contained  in  the
     pattern  string,  subject to the flags in cflags, and places
     the results in the regex_t structure  pointed  to  by  preg.
     Cflags  is  the  bitwise OR of zero or more of the following
     flags:

     REG_EXTENDED  Compile modern (``extended'') REs, rather than
                   the  obsolete  (``basic'')  REs  that  are the
                   default.

     REG_BASIC     This is a synonym for 0, provided as  a  coun-
                   terpart  to  REG_EXTENDED to improve readabil-
                   ity.

     REG_NOSPEC    Compile with recognition of all special  char-
                   acters  turned  off.   All characters are thus
                   considered  ordinary,  so  the  ``RE''  is   a
                   literal  string.  This is an extension, compa-
                   tible with but not specified by POSIX  1003.2,



SunOS 5.5          Last change: March 20, 1994                  1






REGEX(3)               C Library Functions               REGEX(3)



                   and  should  be  used with caution in software
                   intended to  be  portable  to  other  systems.
                   REG_EXTENDED and REG_NOSPEC may not be used in
                   the same call to regcomp.

     REG_ICASE     Compile for matching that ignores  upper/lower
                   case distinctions.  See re_format(7).

     REG_NOSUB     Compile for matching  that  need  only  report
                   success or failure, not what was matched.

     REG_NEWLINE   Compile for  newline-sensitive  matching.   By
                   default,  newline  is  a  completely  ordinary
                   character with no special  meaning  in  either
                   REs  or strings.  With this flag, `[^' bracket
                   expressions and `.' never match newline, a `^'
                   anchor  matches the null string after any new-
                   line in the string in addition to  its  normal
                   function,  and the `$' anchor matches the null
                   string before any newline  in  the  string  in
                   addition to its normal function.

     REG_PEND      The regular expression ends, not at the  first
                   NUL,  but just before the character pointed to
                   by the re_endp member of the structure pointed
                   to  by  preg.   The  re_endp member is of type
                   const char *.  This flag permits inclusion  of
                   NULs  in  the RE; they are considered ordinary
                   characters.  This is an extension,  compatible
                   with  but  not  specified by POSIX 1003.2, and
                   should  be  used  with  caution  in   software
                   intended to be portable to other systems.

     When successful, regcomp returns 0 and fills in  the  struc-
     ture  pointed  to  by  preg.   One  member of that structure
     (other  than  re_endp)  is  publicized:   re_nsub,  of  type
     size_t,  contains the number of parenthesized subexpressions
     within the RE (except that the value of this member is unde-
     fined if the REG_NOSUB flag was used).  If regcomp fails, it
     returns a non-zero error code; see DIAGNOSTICS.

     Regexec matches the compiled RE pointed to by  preg  against
     the  string,  subject  to  the  flags in eflags, and reports
     results using nmatch, pmatch, and the returned  value.   The
     RE  must  have  been  compiled  by  a previous invocation of
     regcomp.  The compiled form is not altered during  execution
     of  regexec,  so a single compiled RE can be used simultane-
     ously by multiple threads.

     By default, the NUL-terminated string pointed to  by  string
     is  considered  to  be the text of an entire line, minus any
     terminating newline.  The eflags argument is the bitwise  OR



SunOS 5.5          Last change: March 20, 1994                  2






REGEX(3)               C Library Functions               REGEX(3)



     of zero or more of the following flags:

     REG_NOTBOL    The first character of the string is  not  the
                   beginning  of a line, so the `^' anchor should
                   not match before it.  This does not affect the
                   behavior of newlines under REG_NEWLINE.

     REG_NOTEOL    The NUL terminating the string does not end  a
                   line,  so  the  `$'  anchor  should  not match
                   before it.  This does not affect the  behavior
                   of newlines under REG_NEWLINE.

     REG_STARTEND  The string is considered to start at  string +
                   pmatch[0].rm_so  and to have a terminating NUL
                   located  at  string +  pmatch[0].rm_eo  (there
                   need  not actually be a NUL at that location),
                   regardless of the value of nmatch.  See  below
                   for the definition of pmatch and nmatch.  This
                   is  an  extension,  compatible  with  but  not
                   specified  by POSIX 1003.2, and should be used
                   with caution in software intended to be  port-
                   able  to  other systems.  Note that a non-zero
                   rm_so does not imply REG_NOTBOL;  REG_STARTEND
                   affects  only  the location of the string, not
                   how it is matched.

     See re_format(7) for a discussion  of  what  is  matched  in
     situations  where an RE or a portion thereof could match any
     of several substrings of string.

     Normally, regexec returns 0 for  success  and  the  non-zero
     code  REG_NOMATCH  for  failure.  Other non-zero error codes
     may be returned in exceptional situations; see DIAGNOSTICS.

     If REG_NOSUB was specified in the compilation of the RE,  or
     if nmatch is 0, regexec ignores the pmatch argument (but see
     below for the case where REG_STARTEND is specified).  Other-
     wise, pmatch points to an array of nmatch structures of type
     regmatch_t.  Such a structure has at least the members rm_so
     and  rm_eo,  both of type regoff_t (a signed arithmetic type
     at least as large as an off_t  and  a  ssize_t),  containing
     respectively  the  offset  of  the first character of a sub-
     string and the offset of the first character after  the  end
     of  the  substring.  Offsets are measured from the beginning
     of the string argument given to regexec.  An empty substring
     is  denoted  by equal offsets, both indicating the character
     following the empty substring.

     The 0th member of the pmatch array is filled in to  indicate
     what  substring  of  string  was  matched  by the entire RE.
     Remaining members  report  what  substring  was  matched  by
     parenthesized subexpressions within the RE; member i reports



SunOS 5.5          Last change: March 20, 1994                  3






REGEX(3)               C Library Functions               REGEX(3)



     subexpression i, with subexpressions counted (starting at 1)
     by the order of their opening parentheses in the RE, left to
     right.  Unused entries in the array-corresponding either  to
     subexpressions that did not participate in the match at all,
     or to subexpressions that do not exist in the RE  (that  is,
     i >  preg->re_nsub)-have both rm_so and rm_eo set to -1.  If
     a subexpression participated in the match several times, the
     reported substring is the last one it matched.  (Note, as an
     example in particular, that  when  the  RE  `(b*)+'  matches
     `bbb',  the  parenthesized subexpression matches each of the
     three `b's and then an infinite number of empty strings fol-
     lowing the last `b', so the reported substring is one of the
     empties.)

     If REG_STARTEND is specified, pmatch must point to at  least
     one  regmatch_t (even if nmatch is 0 or REG_NOSUB was speci-
     fied), to hold the input offsets for REG_STARTEND.  Use  for
     output  is still entirely controlled by nmatch; if nmatch is
     0 or REG_NOSUB was specified, the value  of  pmatch[0]  will
     not be changed by a successful regexec.

     Regerror maps a non-zero  errcode  from  either  regcomp  or
     regexec  to a human-readable, printable message.  If preg is
     non-NULL, the error code should have arisen from use of  the
     regex_t  pointed to by preg, and if the error code came from
     regcomp, it should have been the result from the most recent
     regcomp using that regex_t.  (Regerror may be able to supply
     a more detailed message using information from the regex_t.)
     Regerror  places  the NUL-terminated message into the buffer
     pointed to by errbuf, limiting  the  length  (including  the
     NUL)  to  at  most  errbuf_size bytes.  If the whole message
     won't fit, as much of it as will fit before the  terminating
     NUL  is  supplied.   In  any case, the returned value is the
     size of buffer needed to hold the whole  message  (including
     terminating  NUL).   If  errbuf_size is 0, errbuf is ignored
     but the return value is still correct.

     If  the  errcode  given  to  regerror  is  first  ORed  with
     REG_ITOA, the ``message'' that results is the printable name
     of the error code,  e.g.  ``REG_NOMATCH'',  rather  than  an
     explanation  thereof.   If  errcode  is  REG_ATOI, then preg
     shall be non-NULL and the re_endp member of the structure it
     points to must point to the printable name of an error code;
     in this case, the result in errbuf is the decimal digits  of
     the  numeric  value  of the error code (0 if the name is not
     recognized).  REG_ITOA and REG_ATOI are  intended  primarily
     as  debugging  facilities;  they  are extensions, compatible
     with but not specified by POSIX 1003.2, and should  be  used
     with  caution  in  software intended to be portable to other
     systems.  Be warned also that they are considered experimen-
     tal and changes are possible.




SunOS 5.5          Last change: March 20, 1994                  4






REGEX(3)               C Library Functions               REGEX(3)



     Regfree frees any dynamically-allocated  storage  associated
     with  the  compiled  RE  pointed  to by preg.  The remaining
     regex_t is no longer a valid compiled RE and the  effect  of
     supplying it to regexec or regerror is undefined.

     None of these functions references global  variables  except
     for  tables of constants; all are safe for use from multiple
     threads if the arguments are safe.

IMPLEMENTATION CHOICES
     There are a number of decisions that 1003.2 leaves up to the
     implementor, either by explicitly saying ``undefined'' or by
     virtue of them being forbidden  by  the  RE  grammar.   This
     implementation treats them as follows.

     See re_format(7) for  a  discussion  of  the  definition  of
     case-independent matching.

     There is no particular limit on the length  of  REs,  except
     insofar as memory is limited.  Memory usage is approximately
     linear in RE size, and largely insensitive to RE complexity,
     except  for  bounded repetitions.  See BUGS for one short RE
     using them that will run almost any system out of memory.

     A backslashed character other than one specifically given  a
     magic  meaning  by 1003.2 (such magic meanings occur only in
     obsolete [``basic''] REs) is taken as an ordinary character.

     Any unmatched [ is a REG_EBRACK error.

     Equivalence classes cannot begin or  end  bracket-expression
     ranges.  The endpoint of one range cannot begin another.

     RE_DUP_MAX,  the  limit  on  repetition  counts  in  bounded
     repetitions, is 255.

     A repetition operator (?, *, +,  or  bounds)  cannot  follow
     another  repetition  operator.  A repetition operator cannot
     begin an expression or subexpression or follow `^' or `|'.

     `|' cannot appear first or  last  in  a  (sub)expression  or
     after another `|', i.e. an operand of `|' cannot be an empty
     subexpression.  An empty parenthesized subexpression,  `()',
     is  legal and matches an empty (sub)string.  An empty string
     is not a legal RE.

     A `{' followed by a digit is  considered  the  beginning  of
     bounds  for a bounded repetition, which must then follow the
     syntax for bounds.  A `{' not followed by a  digit  is  con-
     sidered an ordinary character.





SunOS 5.5          Last change: March 20, 1994                  5






REGEX(3)               C Library Functions               REGEX(3)



     `^' and `$' beginning and ending subexpressions in  obsolete
     (``basic'') REs are anchors, not ordinary characters.

SEE ALSO
     grep(1), re_format(7)

     POSIX 1003.2, sections 2.8 (Regular Expression Notation) and
     B.5 (C Binding for Regular Expression Matching).

DIAGNOSTICS
     Non-zero error codes from regcomp and  regexec  include  the
     following:

     REG_NOMATCH    regexec() failed to match
     REG_BADPAT     invalid regular expression
     REG_ECOLLATE   invalid collating element
     REG_ECTYPE     invalid character class
     REG_EESCAPE    \ applied to unescapable character
     REG_ESUBREG    invalid backreference number
     REG_EBRACK     brackets [ ] not balanced
     REG_EPAREN     parentheses ( ) not balanced
     REG_EBRACE     braces { } not balanced
     REG_BADBR      invalid repetition count(s) in { }
     REG_ERANGE     invalid character range in [ ]
     REG_ESPACE     ran out of memory
     REG_BADRPT     ?, *, or + operand invalid
     REG_EMPTY      empty (sub)expression
     REG_ASSERT     ``can't happen''-you found a bug
     REG_INVARG     invalid argument, e.g. negative-length string

HISTORY
     Originally written by Henry Spencer.  Altered for  inclusion
     in the 4.4BSD distribution.

BUGS
     This is an alpha release with known defects.  Please  report
     problems.

     There is one known functionality bug.  The implementation of
     internationalization  is  incomplete:   the locale is always
     assumed to be the default one of 1003.2, and only  the  col-
     lating elements etc. of that locale are available.

     The back-reference code is subtle and  doubts  linger  about
     its correctness in complex cases.

     Regexec performance is poor.  This will improve  with  later
     releases.  Nmatch exceeding 0 is expensive; nmatch exceeding
     1 is worse.  Regexec is largely insensitive to RE complexity
     except  that  back  references  are massively expensive.  RE
     length does matter; in particular, there is a  strong  speed
     bonus  for keeping RE length under about 30 characters, with



SunOS 5.5          Last change: March 20, 1994                  6






REGEX(3)               C Library Functions               REGEX(3)



     most special characters counting roughly double.

     Regcomp implements bounded repetitions by  macro  expansion,
     which  is  costly  in  time and space if counts are large or
     bounded  repetitions  are  nested.    An   RE   like,   say,
     `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (eventu-
     ally) run almost any existing machine out of swap space.

     There are suspected problems with response to obscure  error
     conditions.   Notably,  certain  kinds of internal overflow,
     produced only by truly enormous REs or  by  multiply  nested
     bounded repetitions, are probably not handled well.

     Due to a mistake in 1003.2, things like `a)b' are legal  REs
     because `)' is a special character only in the presence of a
     previous unmatched `('.  This can't be fixed until the  spec
     is fixed.

     The standard's definition of back references is vague.   For
     example,  does  `a\(\(b\)*\2\)*d'  match `abbbd'?  Until the
     standard is clarified, behavior in such cases should not  be
     relied on.

     The implementation of word-boundary matching is a bit  of  a
     kludge,  and  bugs may lurk in combinations of word-boundary
     matching and anchoring.





























SunOS 5.5          Last change: March 20, 1994                  7