From be6aea6fb9a1c3923f871b4fc01e34cb70704fda Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Sat, 22 May 1999 12:52:41 +0000 Subject: Document locale-specific changes to `sort', as well as the new, POSIX-compliant definition of line comparison, and -g's more careful treatment of NaNs, infinities and zeros. --- doc/textutils.texi | 96 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 62 insertions(+), 34 deletions(-) (limited to 'doc') diff --git a/doc/textutils.texi b/doc/textutils.texi index d1fdaffc7..044dac054 100644 --- a/doc/textutils.texi +++ b/doc/textutils.texi @@ -44,7 +44,7 @@ START-INFO-DIR-ENTRY * tsort: (textutils)tsort invocation. Topological sort. * tr: (textutils)tr invocation. Translate characters. * unexpand: (textutils)unexpand invocation. Convert spaces to tabs. -* uniq: (textutils)uniq invocation. Uniqify files. +* uniq: (textutils)uniq invocation. Uniquify files. * wc: (textutils)wc invocation. Byte, word, and line counts. END-INFO-DIR-ENTRY @end format @@ -161,7 +161,7 @@ Summarizing files Operating on sorted files * sort invocation:: Sort text files. -* uniq invocation:: Uniqify files. +* uniq invocation:: Uniquify files. * comm invocation:: Compare two sorted files line by line. * ptx invocation:: Produce a permuted index of file contents. * tsort invocation:: Topological sort. @@ -672,7 +672,7 @@ Output at most @var{bytes} bytes of the input. Prefixes and suffixes on @opindex --strings @cindex string constants, outputting Instead of the normal output, output only @dfn{string constants}: at -least @var{n} (3 by default) consecutive ASCII graphic characters, +least @var{n} (3 by default) consecutive @sc{ASCII} graphic characters, followed by a null (zero) byte. @item -t @var{type} @@ -687,14 +687,14 @@ of each output line using each of the data types that you specified, in the order that you specified. Adding a trailing ``z'' to any type specification appends a display -of the ASCII character representation of the printable characters +of the @sc{ASCII} character representation of the printable characters to the output line generated by the type specification. @table @samp @item a named character, @item c -ASCII character or backslash escape, +@sc{ASCII} character or backslash escape, @item d signed decimal, @item f @@ -779,7 +779,7 @@ Output as octal bytes. Equivalent to @samp{-toC}. @item -c @opindex -c -Output as ASCII characters or backslash escapes. Equivalent to +Output as @sc{ASCII} characters or backslash escapes. Equivalent to @samp{-tc}. @item -d @@ -1998,7 +1998,7 @@ These commands work with (or produce) sorted files. @menu * sort invocation:: Sort text files. -* uniq invocation:: Uniqify files. +* uniq invocation:: Uniquify files. * comm invocation:: Compare two sorted files line by line. * ptx invocation:: Produce a permuted index of file contents. * tsort invocation:: Topological sort. @@ -2043,18 +2043,21 @@ works. @end table +@vindex LC_COLLATE A pair of lines is compared as follows: if any key fields have been specified, @code{sort} compares each pair of fields, in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. +Unless otherwise specified, all comparisons use the character +collating sequence specified by the @env{LC_COLLATE} locale. If any of the global options @samp{Mbdfinr} are given but no key fields are specified, @code{sort} compares the entire lines according to the global options. Finally, as a last resort when all keys compare equal (or if no -ordering options were specified at all), @code{sort} compares the lines -byte by byte in machine collating sequence. The last resort comparison +ordering options were specified at all), @code{sort} compares the entire +lines. The last resort comparison honors the @samp{-r} global option. The @samp{-s} (stable) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. If no fields @@ -2063,7 +2066,10 @@ or global options are specified, @samp{-s} has no effect. GNU @code{sort} (as specified for all GNU utilities) has no limits on input line length or restrictions on bytes allowed within lines. In addition, if the final byte of an input file is not a newline, GNU -@code{sort} silently supplies one. +@code{sort} silently supplies one. A line's trailing newline is part of +the line for comparison purposes; for example, with no options in an +@sc{ASCII} locale, a line starting with a tab sorts before an empty line +because tab precedes newline in the @sc{ASCII} collating sequence. Upon any error, @code{sort} exits with a status of @samp{2}. @@ -2073,11 +2079,14 @@ value as the directory for temporary files instead of @file{/tmp}. The @samp{-T @var{tempdir}} option in turn overrides the environment variable. +@vindex LC_CTYPE The following options affect the ordering of output lines. They may be specified globally or as part of a specific key field. If no key fields are specified, global options apply to comparison of entire lines; otherwise the global options are inherited by key fields that do -not specify any special options of their own. +not specify any special options of their own. The @samp{-b}, @samp{-d}, +@samp{-f} and @samp{-i} options classify characters according to +the @env{LC_CTYPE} locale. @table @samp @@ -2102,40 +2111,59 @@ sorting so that, for example, @samp{b} and @samp{B} sort as equal. @item -g @opindex -g @cindex general numeric sort -Sort numerically, but use strtod(3) to arrive at the numeric values. +Sort numerically, using the standard C function @code{strtod} to convert +a prefix of each line to a double-precision floating point number. This allows floating point numbers to be specified in scientific notation, -like @code{1.0e-34} and @code{10e100}. Use this option only if there -is no alternative; it is much slower than @samp{-n} and numbers with -too many significant digits will be compared as if they had been -truncated. In addition, numbers outside the range of representable -double precision floating point numbers are treated as if they were -zeroes; overflow and underflow are not reported. +like @code{1.0e-34} and @code{10e100}. +Do not report overflow, underflow, or conversion errors. +Use the following collating sequence: + +@itemize @bullet +@item +Lines that do not start with numbers (all considered to be equal). +@item +NaNs (``Not a Number'' values, in IEEE floating point arithmetic) +in a consistent but machine-dependent order. +@item +Minus infinity. +@item +Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal). +@item +Plus infinity. +@end itemize + +Use this option only if there is no alternative; it is much slower than +@samp{-n} and it can lose information when converting to floating point. @item -i @opindex -i @cindex unprintable characters, ignoring -Ignore characters outside the printable ASCII range 040-0176 octal -(inclusive) when sorting. +Ignore unprintable characters. @item -M @opindex -M @cindex months, sorting by +@vindex LC_TIME An initial string, consisting of any amount of whitespace, followed -by three letters abbreviating a month name, is folded to UPPER case and +by a month name abbreviation, is folded to UPPER case and compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}. -Invalid names compare low to valid names. +Invalid names compare low to valid names. The @env{LC_TIME} locale +determines the month spellings. @item -n @opindex -n @cindex numeric sort +@vindex LC_NUMERIC Sort numerically: the number begins each line; specifically, it consists of optional whitespace, an optional @samp{-} sign, and zero or more -digits, optionally followed by a decimal point and zero or more digits. +digits possibly separated by thousands separators, optionally followed +by a radix character and zero or more digits. The @env{LC_NUMERIC} +locale specifies the radix character and thousands separator. @code{sort -n} uses what might be considered an unconventional method to compare strings representing floating point numbers. Rather than first converting each string to the C @code{double} type and then -comparing those values, sort aligns the decimal points in the two +comparing those values, sort aligns the radix characters in the two strings and compares the strings a character at a time. One benefit of using this approach is its speed. In practice this is much more efficient than performing the two corresponding string-to-double (or even @@ -2180,7 +2208,7 @@ following. @item -u @opindex -u -@cindex uniqifying output +@cindex uniquifying output For the default case or the @samp{-m} option, only output the first of a sequence of lines that compare equal. For the @samp{-c} option, check that no pair of consecutive lines compares equal. @@ -2199,7 +2227,7 @@ See below for more examples. @opindex -z @cindex sort zero-terminated lines Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII} -@sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.) +@sc{NUL} (Null) character) instead of an @sc{ASCII} @sc{LF} (Line Feed). This option can be useful in conjunction with @samp{perl -0} or @samp{find -print0} and @samp{xargs -0} which do the same in order to reliably handle arbitrary pathnames (even those which contain Line Feed @@ -2342,10 +2370,10 @@ sort -t : -b -k 5,5 -k 3,3n /etc/passwd @node uniq invocation -@section @code{uniq}: Uniqify files +@section @code{uniq}: Uniquify files @pindex uniq -@cindex uniqify files +@cindex uniquify files @code{uniq} writes the unique lines in the given @file{input}, or standard input if nothing is given or for an @var{input} name of @@ -2618,7 +2646,7 @@ As it is setup now, the program assumes that the input file is coded using 8-bit ISO 8859-1 code, also known as Latin-1 character set, @emph{unless} if it is compiled for MS-DOS, in which case it uses the character set of the IBM-PC. (GNU @code{ptx} is not known to work on -smaller MS-DOS machines anymore.) Compared to 7-bit ASCII, the set of +smaller MS-DOS machines anymore.) Compared to 7-bit @sc{ASCII}, the set of characters which are letters is then different, this fact alters the behaviour of regular expression matching. Thus, the default regular expression for a keyword allows foreign or diacriticized letters. @@ -2907,7 +2935,7 @@ sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other diacriticized characters of the underlying character set produce an appropriate @TeX{} sequence as far as possible. The other non-graphical characters, like newline and tab, and all others characters which are -not part of ASCII, are merely changed to exactly one space, with no +not part of @sc{ASCII}, are merely changed to exactly one space, with no special attempt to compress consecutive spaces. Let me know how to improve this special character processing for @TeX{}. @@ -3842,8 +3870,8 @@ yourself using when setting up fancy data plumbing. The @code{sort} command reads and sorts each file named on the command line. It then merges the sorted data and writes it to standard output. It will read standard input if no files are given on the command line (thus -making it into a filter). The sort is based on the machine collating -sequence (@sc{ASCII}) or based on user-supplied ordering criteria. +making it into a filter). The sort is based on the character collating +sequence or based on user-supplied ordering criteria. @node The uniq command @@ -4019,7 +4047,7 @@ $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ... The second @code{tr} command operates on the complement of the listed characters, which are all the letters, the digits, the underscore, and the blank. The @samp{\012} represents the newline character; it has to -be left alone. (The ASCII TAB character should also be included for +be left alone. (The @sc{ASCII} tab character should also be included for good measure in a production script.) At this point, we have data consisting of words separated by blank space. @@ -4065,7 +4093,7 @@ with the help of two more @code{sort} options: @table @samp @item -n -do a numeric sort, not an ASCII one +do a numeric sort, not a textual one @item -r reverse the order of the sort -- cgit v1.2.3-70-g09d2