summaryrefslogtreecommitdiff
path: root/regex/re_format.doc
diff options
context:
space:
mode:
Diffstat (limited to 'regex/re_format.doc')
-rw-r--r--regex/re_format.doc264
1 files changed, 264 insertions, 0 deletions
diff --git a/regex/re_format.doc b/regex/re_format.doc
new file mode 100644
index 00000000..7a8acd7d
--- /dev/null
+++ b/regex/re_format.doc
@@ -0,0 +1,264 @@
+
+
+
+RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
+
+
+
+NAME
+ re_format - POSIX 1003.2 regular expressions
+
+DESCRIPTION
+ Regular expressions (``RE''s), as defined in POSIX 1003.2,
+ come in two forms: modern REs (roughly those of egrep;
+ 1003.2 calls these ``extended'' REs) and obsolete REs
+ (roughly those of ed; 1003.2 ``basic'' REs). Obsolete REs
+ mostly exist for backward compatibility in some old pro-
+ grams; they will be discussed at the end. 1003.2 leaves
+ some aspects of RE syntax and semantics open; ` - ' marks
+ decisions on these aspects that may not be fully portable to
+ other 1003.2 implementations.
+
+ A (modern) RE is one- or more non-empty- branches, separated
+ by `|'. It matches anything that matches one of the
+ branches.
+
+ A branch is one- or more pieces, concatenated. It matches a
+ match for the first, followed by a match for the second,
+ etc.
+
+ A piece is an atom possibly followed by a single- `*', `+',
+ `?', or bound. An atom followed by `*' matches a sequence
+ of 0 or more matches of the atom. An atom followed by `+'
+ matches a sequence of 1 or more matches of the atom. An
+ atom followed by `?' matches a sequence of 0 or 1 matches of
+ the atom.
+
+ A bound is `{' followed by an unsigned decimal integer, pos-
+ sibly followed by `,' possibly followed by another unsigned
+ decimal integer, always followed by `}'. The integers must
+ lie between 0 and RE_DUP_MAX (255-) inclusive, and if there
+ are two of them, the first may not exceed the second. An
+ atom followed by a bound containing one integer i and no
+ comma matches a sequence of exactly i matches of the atom.
+ An atom followed by a bound containing one integer i and a
+ comma matches a sequence of i or more matches of the atom.
+ An atom followed by a bound containing two integers i and j
+ matches a sequence of i through j (inclusive) matches of the
+ atom.
+
+ An atom is a regular expression enclosed in `()' (matching a
+ match for the regular expression), an empty set of `()'
+ (matching the null string) - , a bracket expression (see
+ below), `.' (matching any single character), `^' (matching
+ the null string at the beginning of a line), `$' (matching
+ the null string at the end of a line), a `\' followed by one
+ of the characters `^.[$()|*+?{\' (matching that character
+ taken as an ordinary character), a `\' followed by any other
+ character- (matching that character taken as an ordinary
+ character, as if the `\' had not been present-), or a single
+
+
+
+SunOS 5.5 Last change: March 20, 1994 1
+
+
+
+
+
+
+RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
+
+
+
+ character with no other significance (matching that charac-
+ ter). A `{' followed by a character other than a digit is
+ an ordinary character, not the beginning of a bound-. It is
+ illegal to end an RE with `\'.
+
+ A bracket expression is a list of characters enclosed in
+ `[]'. It normally matches any single character from the
+ list (but see below). If the list begins with `^', it
+ matches any single character (but see below) not from the
+ rest of the list. If two characters in the list are
+ separated by ` -', this is shorthand for the full range of
+ characters between those two (inclusive) in the collating
+ sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
+ It is illegal- for two ranges to share an endpoint, e.g.
+ `a-c-e'. Ranges are very collating-sequence-dependent, and
+ portable programs should avoid relying on them.
+
+ To include a literal `]' in the list, make it the first
+ character (following a possible `^'). To include a literal
+ `-', make it the first or last character, or the second end-
+ point of a range. To use a literal `-' as the first end-
+ point of a range, enclose it in `[.' and `.]' to make it a
+ collating element (see below). With the exception of these
+ and some combinations using `[' (see next paragraphs), all
+ other special characters, including `\', lose their special
+ significance within a bracket expression.
+
+ Within a bracket expression, a collating element (a charac-
+ ter, a multi-character sequence that collates as if it were
+ a single character, or a collating-sequence name for either)
+ enclosed in `[.' and `.]' stands for the sequence of charac-
+ ters of that collating element. The sequence is a single
+ element of the bracket expression's list. A bracket expres-
+ sion containing a multi-character collating element can thus
+ match more than one character, e.g. if the collating
+ sequence includes a `ch' collating element, then the RE
+ `[[.ch.]]*c' matches the first five characters of `chchcc'.
+
+ Within a bracket expression, a collating element enclosed in
+ `[=' and `=]' is an equivalence class, standing for the
+ sequences of characters of all collating elements equivalent
+ to that one, including itself. (If there are no other
+ equivalent collating elements, the treatment is as if the
+ enclosing delimiters were `[.' and `.]'.) For example, if o
+ and ^ are the members of an equivalence class, then
+ `[[=o=]]', `[[=^=]]', and `[o^]' are all synonymous. An
+ equivalence class may not- be an endpoint of a range.
+
+ Within a bracket expression, the name of a character class
+ enclosed in `[:' and `:]' stands for the list of all charac-
+ ters belonging to that class. Standard character class
+ names are:
+
+
+
+SunOS 5.5 Last change: March 20, 1994 2
+
+
+
+
+
+
+RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
+
+
+
+ alnum digit punct
+ alpha graph space
+ blank lower upper
+ cntrl print xdigit
+
+ These stand for the character classes defined in ctype(3).
+ A locale may provide others. A character class may not be
+ used as an endpoint of a range.
+
+ There are two special cases- of bracket expressions: the
+ bracket expressions `[[:<:]]' and `[[:>:]]' match the null
+ string at the beginning and end of a word respectively. A
+ word is defined as a sequence of word characters which is
+ neither preceded nor followed by word characters. A word
+ character is an alnum character (as defined by ctype(3)) or
+ an underscore. This is an extension, compatible with but
+ not specified by POSIX 1003.2, and should be used with cau-
+ tion in software intended to be portable to other systems.
+
+ In the event that an RE could match more than one substring
+ of a given string, the RE matches the one starting earliest
+ in the string. If the RE could match more than one sub-
+ string starting at that point, it matches the longest.
+ Subexpressions also match the longest possible substrings,
+ subject to the constraint that the whole match be as long as
+ possible, with subexpressions starting earlier in the RE
+ taking priority over ones starting later. Note that
+ higher-level subexpressions thus take priority over their
+ lower-level component subexpressions.
+
+ Match lengths are measured in characters, not collating ele-
+ ments. A null string is considered longer than no match at
+ all. For example, `bb*' matches the three middle characters
+ of `abbbc', `(wee|week)(knights|nights)' matches all ten
+ characters of `weeknights', when `(.*).*' is matched against
+ `abc' the parenthesized subexpression matches all three
+ characters, and when `(a*)*' is matched against `bc' both
+ the whole RE and the parenthesized subexpression match the
+ null string.
+
+ If case-independent matching is specified, the effect is
+ much as if all case distinctions had vanished from the
+ alphabet. When an alphabetic that exists in multiple cases
+ appears as an ordinary character outside a bracket expres-
+ sion, it is effectively transformed into a bracket expres-
+ sion containing both cases, e.g. `x' becomes `[xX]'. When
+ it appears inside a bracket expression, all case counter-
+ parts of it are added to the bracket expression, so that
+ (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes `[^xX]'.
+
+ No particular limit is imposed on the length of REs-. Pro-
+ grams intended to be portable should not employ REs longer
+
+
+
+SunOS 5.5 Last change: March 20, 1994 3
+
+
+
+
+
+
+RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
+
+
+
+ than 256 bytes, as an implementation can refuse to accept
+ such REs and remain POSIX-compliant.
+
+ Obsolete (``basic'') regular expressions differ in several
+ respects. `|', `+', and `?' are ordinary characters and
+ there is no equivalent for their functionality. The delim-
+ iters for bounds are `\{' and `\}', with `{' and `}' by
+ themselves ordinary characters. The parentheses for nested
+ subexpressions are `\(' and `\)', with `(' and `)' by them-
+ selves ordinary characters. `^' is an ordinary character
+ except at the beginning of the RE or- the beginning of a
+ parenthesized subexpression, `$' is an ordinary character
+ except at the end of the RE or- the end of a parenthesized
+ subexpression, and `*' is an ordinary character if it
+ appears at the beginning of the RE or the beginning of a
+ parenthesized subexpression (after a possible leading `^').
+ Finally, there is one new type of atom, a back reference:
+ `\' followed by a non-zero decimal digit d matches the same
+ sequence of characters matched by the dth parenthesized
+ subexpression (numbering subexpressions by the positions of
+ their opening parentheses, left to right), so that (e.g.)
+ `\([bc]\)\1' matches `bb' or `cc' but not `bc'.
+
+SEE ALSO
+ regex(3)
+
+ POSIX 1003.2, section 2.8 (Regular Expression Notation).
+
+BUGS
+ Having two kinds of REs is a botch.
+
+ The current 1003.2 spec says that `)' is an ordinary charac-
+ ter in the absence of an unmatched `('; this was an uninten-
+ tional result of a wording error, and change is likely.
+ Avoid relying on it.
+
+ Back references are a dreadful botch, posing major problems
+ for efficient implementations. They are also somewhat
+ vaguely defined (does `a\(\(b\)*\2\)*d' match `abbbd'?).
+ Avoid using them.
+
+ 1003.2's specification of case-independent matching is
+ vague. The ``one case implies all cases'' definition given
+ above is current consensus among implementors as to the
+ right interpretation.
+
+ The syntax for word boundaries is incredibly ugly.
+
+
+
+
+
+
+
+
+SunOS 5.5 Last change: March 20, 1994 4
+
+
+