diff options
Diffstat (limited to 'regex/regex.doc')
-rw-r--r-- | regex/regex.doc | 462 |
1 files changed, 462 insertions, 0 deletions
diff --git a/regex/regex.doc b/regex/regex.doc new file mode 100644 index 00000000..100be4fb --- /dev/null +++ b/regex/regex.doc @@ -0,0 +1,462 @@ + + + +REGEX(3) C Library Functions REGEX(3) + + + +NAME + regcomp, regexec, regerror, regfree - regular-expression + library + +SYNOPSIS + #include <sys/types.h> + #include <regex.h> + + int regcomp(regex_t *preg, const char *pattern, int cflags); + + int regexec(const regex_t *preg, const char *string, + size_t nmatch, regmatch_t pmatch[], int eflags); + + size_t regerror(int errcode, const regex_t *preg, + char *errbuf, size_t errbuf_size); + + void regfree(regex_t *preg); + +DESCRIPTION + These routines implement POSIX 1003.2 regular expressions + (``RE''s); see re_format(7). Regcomp compiles an RE written + as a string into an internal form, regexec matches that + internal form against a string and reports results, regerror + transforms error codes from either into human-readable mes- + sages, and regfree frees any dynamically-allocated storage + used by the internal form of an RE. + + The header <regex.h> declares two structure types, regex_t + and regmatch_t, the former for compiled internal forms and + the latter for match reporting. It also declares the four + functions, a type regoff_t, and a number of constants with + names starting with ``REG_''. + + Regcomp compiles the regular expression contained in the + pattern string, subject to the flags in cflags, and places + the results in the regex_t structure pointed to by preg. + Cflags is the bitwise OR of zero or more of the following + flags: + + REG_EXTENDED Compile modern (``extended'') REs, rather than + the obsolete (``basic'') REs that are the + default. + + REG_BASIC This is a synonym for 0, provided as a coun- + terpart to REG_EXTENDED to improve readabil- + ity. + + REG_NOSPEC Compile with recognition of all special char- + acters turned off. All characters are thus + considered ordinary, so the ``RE'' is a + literal string. This is an extension, compa- + tible with but not specified by POSIX 1003.2, + + + +SunOS 5.5 Last change: March 20, 1994 1 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + and should be used with caution in software + intended to be portable to other systems. + REG_EXTENDED and REG_NOSPEC may not be used in + the same call to regcomp. + + REG_ICASE Compile for matching that ignores upper/lower + case distinctions. See re_format(7). + + REG_NOSUB Compile for matching that need only report + success or failure, not what was matched. + + REG_NEWLINE Compile for newline-sensitive matching. By + default, newline is a completely ordinary + character with no special meaning in either + REs or strings. With this flag, `[^' bracket + expressions and `.' never match newline, a `^' + anchor matches the null string after any new- + line in the string in addition to its normal + function, and the `$' anchor matches the null + string before any newline in the string in + addition to its normal function. + + REG_PEND The regular expression ends, not at the first + NUL, but just before the character pointed to + by the re_endp member of the structure pointed + to by preg. The re_endp member is of type + const char *. This flag permits inclusion of + NULs in the RE; they are considered ordinary + characters. This is an extension, compatible + with but not specified by POSIX 1003.2, and + should be used with caution in software + intended to be portable to other systems. + + When successful, regcomp returns 0 and fills in the struc- + ture pointed to by preg. One member of that structure + (other than re_endp) is publicized: re_nsub, of type + size_t, contains the number of parenthesized subexpressions + within the RE (except that the value of this member is unde- + fined if the REG_NOSUB flag was used). If regcomp fails, it + returns a non-zero error code; see DIAGNOSTICS. + + Regexec matches the compiled RE pointed to by preg against + the string, subject to the flags in eflags, and reports + results using nmatch, pmatch, and the returned value. The + RE must have been compiled by a previous invocation of + regcomp. The compiled form is not altered during execution + of regexec, so a single compiled RE can be used simultane- + ously by multiple threads. + + By default, the NUL-terminated string pointed to by string + is considered to be the text of an entire line, minus any + terminating newline. The eflags argument is the bitwise OR + + + +SunOS 5.5 Last change: March 20, 1994 2 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + of zero or more of the following flags: + + REG_NOTBOL The first character of the string is not the + beginning of a line, so the `^' anchor should + not match before it. This does not affect the + behavior of newlines under REG_NEWLINE. + + REG_NOTEOL The NUL terminating the string does not end a + line, so the `$' anchor should not match + before it. This does not affect the behavior + of newlines under REG_NEWLINE. + + REG_STARTEND The string is considered to start at string + + pmatch[0].rm_so and to have a terminating NUL + located at string + pmatch[0].rm_eo (there + need not actually be a NUL at that location), + regardless of the value of nmatch. See below + for the definition of pmatch and nmatch. This + is an extension, compatible with but not + specified by POSIX 1003.2, and should be used + with caution in software intended to be port- + able to other systems. Note that a non-zero + rm_so does not imply REG_NOTBOL; REG_STARTEND + affects only the location of the string, not + how it is matched. + + See re_format(7) for a discussion of what is matched in + situations where an RE or a portion thereof could match any + of several substrings of string. + + Normally, regexec returns 0 for success and the non-zero + code REG_NOMATCH for failure. Other non-zero error codes + may be returned in exceptional situations; see DIAGNOSTICS. + + If REG_NOSUB was specified in the compilation of the RE, or + if nmatch is 0, regexec ignores the pmatch argument (but see + below for the case where REG_STARTEND is specified). Other- + wise, pmatch points to an array of nmatch structures of type + regmatch_t. Such a structure has at least the members rm_so + and rm_eo, both of type regoff_t (a signed arithmetic type + at least as large as an off_t and a ssize_t), containing + respectively the offset of the first character of a sub- + string and the offset of the first character after the end + of the substring. Offsets are measured from the beginning + of the string argument given to regexec. An empty substring + is denoted by equal offsets, both indicating the character + following the empty substring. + + The 0th member of the pmatch array is filled in to indicate + what substring of string was matched by the entire RE. + Remaining members report what substring was matched by + parenthesized subexpressions within the RE; member i reports + + + +SunOS 5.5 Last change: March 20, 1994 3 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + subexpression i, with subexpressions counted (starting at 1) + by the order of their opening parentheses in the RE, left to + right. Unused entries in the array-corresponding either to + subexpressions that did not participate in the match at all, + or to subexpressions that do not exist in the RE (that is, + i > preg->re_nsub)-have both rm_so and rm_eo set to -1. If + a subexpression participated in the match several times, the + reported substring is the last one it matched. (Note, as an + example in particular, that when the RE `(b*)+' matches + `bbb', the parenthesized subexpression matches each of the + three `b's and then an infinite number of empty strings fol- + lowing the last `b', so the reported substring is one of the + empties.) + + If REG_STARTEND is specified, pmatch must point to at least + one regmatch_t (even if nmatch is 0 or REG_NOSUB was speci- + fied), to hold the input offsets for REG_STARTEND. Use for + output is still entirely controlled by nmatch; if nmatch is + 0 or REG_NOSUB was specified, the value of pmatch[0] will + not be changed by a successful regexec. + + Regerror maps a non-zero errcode from either regcomp or + regexec to a human-readable, printable message. If preg is + non-NULL, the error code should have arisen from use of the + regex_t pointed to by preg, and if the error code came from + regcomp, it should have been the result from the most recent + regcomp using that regex_t. (Regerror may be able to supply + a more detailed message using information from the regex_t.) + Regerror places the NUL-terminated message into the buffer + pointed to by errbuf, limiting the length (including the + NUL) to at most errbuf_size bytes. If the whole message + won't fit, as much of it as will fit before the terminating + NUL is supplied. In any case, the returned value is the + size of buffer needed to hold the whole message (including + terminating NUL). If errbuf_size is 0, errbuf is ignored + but the return value is still correct. + + If the errcode given to regerror is first ORed with + REG_ITOA, the ``message'' that results is the printable name + of the error code, e.g. ``REG_NOMATCH'', rather than an + explanation thereof. If errcode is REG_ATOI, then preg + shall be non-NULL and the re_endp member of the structure it + points to must point to the printable name of an error code; + in this case, the result in errbuf is the decimal digits of + the numeric value of the error code (0 if the name is not + recognized). REG_ITOA and REG_ATOI are intended primarily + as debugging facilities; they are extensions, compatible + with but not specified by POSIX 1003.2, and should be used + with caution in software intended to be portable to other + systems. Be warned also that they are considered experimen- + tal and changes are possible. + + + + +SunOS 5.5 Last change: March 20, 1994 4 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + Regfree frees any dynamically-allocated storage associated + with the compiled RE pointed to by preg. The remaining + regex_t is no longer a valid compiled RE and the effect of + supplying it to regexec or regerror is undefined. + + None of these functions references global variables except + for tables of constants; all are safe for use from multiple + threads if the arguments are safe. + +IMPLEMENTATION CHOICES + There are a number of decisions that 1003.2 leaves up to the + implementor, either by explicitly saying ``undefined'' or by + virtue of them being forbidden by the RE grammar. This + implementation treats them as follows. + + See re_format(7) for a discussion of the definition of + case-independent matching. + + There is no particular limit on the length of REs, except + insofar as memory is limited. Memory usage is approximately + linear in RE size, and largely insensitive to RE complexity, + except for bounded repetitions. See BUGS for one short RE + using them that will run almost any system out of memory. + + A backslashed character other than one specifically given a + magic meaning by 1003.2 (such magic meanings occur only in + obsolete [``basic''] REs) is taken as an ordinary character. + + Any unmatched [ is a REG_EBRACK error. + + Equivalence classes cannot begin or end bracket-expression + ranges. The endpoint of one range cannot begin another. + + RE_DUP_MAX, the limit on repetition counts in bounded + repetitions, is 255. + + A repetition operator (?, *, +, or bounds) cannot follow + another repetition operator. A repetition operator cannot + begin an expression or subexpression or follow `^' or `|'. + + `|' cannot appear first or last in a (sub)expression or + after another `|', i.e. an operand of `|' cannot be an empty + subexpression. An empty parenthesized subexpression, `()', + is legal and matches an empty (sub)string. An empty string + is not a legal RE. + + A `{' followed by a digit is considered the beginning of + bounds for a bounded repetition, which must then follow the + syntax for bounds. A `{' not followed by a digit is con- + sidered an ordinary character. + + + + + +SunOS 5.5 Last change: March 20, 1994 5 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + `^' and `$' beginning and ending subexpressions in obsolete + (``basic'') REs are anchors, not ordinary characters. + +SEE ALSO + grep(1), re_format(7) + + POSIX 1003.2, sections 2.8 (Regular Expression Notation) and + B.5 (C Binding for Regular Expression Matching). + +DIAGNOSTICS + Non-zero error codes from regcomp and regexec include the + following: + + REG_NOMATCH regexec() failed to match + REG_BADPAT invalid regular expression + REG_ECOLLATE invalid collating element + REG_ECTYPE invalid character class + REG_EESCAPE \ applied to unescapable character + REG_ESUBREG invalid backreference number + REG_EBRACK brackets [ ] not balanced + REG_EPAREN parentheses ( ) not balanced + REG_EBRACE braces { } not balanced + REG_BADBR invalid repetition count(s) in { } + REG_ERANGE invalid character range in [ ] + REG_ESPACE ran out of memory + REG_BADRPT ?, *, or + operand invalid + REG_EMPTY empty (sub)expression + REG_ASSERT ``can't happen''-you found a bug + REG_INVARG invalid argument, e.g. negative-length string + +HISTORY + Originally written by Henry Spencer. Altered for inclusion + in the 4.4BSD distribution. + +BUGS + This is an alpha release with known defects. Please report + problems. + + There is one known functionality bug. The implementation of + internationalization is incomplete: the locale is always + assumed to be the default one of 1003.2, and only the col- + lating elements etc. of that locale are available. + + The back-reference code is subtle and doubts linger about + its correctness in complex cases. + + Regexec performance is poor. This will improve with later + releases. Nmatch exceeding 0 is expensive; nmatch exceeding + 1 is worse. Regexec is largely insensitive to RE complexity + except that back references are massively expensive. RE + length does matter; in particular, there is a strong speed + bonus for keeping RE length under about 30 characters, with + + + +SunOS 5.5 Last change: March 20, 1994 6 + + + + + + +REGEX(3) C Library Functions REGEX(3) + + + + most special characters counting roughly double. + + Regcomp implements bounded repetitions by macro expansion, + which is costly in time and space if counts are large or + bounded repetitions are nested. An RE like, say, + `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (eventu- + ally) run almost any existing machine out of swap space. + + There are suspected problems with response to obscure error + conditions. Notably, certain kinds of internal overflow, + produced only by truly enormous REs or by multiply nested + bounded repetitions, are probably not handled well. + + Due to a mistake in 1003.2, things like `a)b' are legal REs + because `)' is a special character only in the presence of a + previous unmatched `('. This can't be fixed until the spec + is fixed. + + The standard's definition of back references is vague. For + example, does `a\(\(b\)*\2\)*d' match `abbbd'? Until the + standard is clarified, behavior in such cases should not be + relied on. + + The implementation of word-boundary matching is a bit of a + kludge, and bugs may lurk in combinations of word-boundary + matching and anchoring. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +SunOS 5.5 Last change: March 20, 1994 7 + + + |