summaryrefslogtreecommitdiff
path: root/imap/docs/rfc/rfc5051.txt
diff options
context:
space:
mode:
authorEduardo Chappa <chappa@washington.edu>2013-11-02 02:51:18 -0600
committerEduardo Chappa <chappa@washington.edu>2013-11-02 02:51:18 -0600
commit7fe712882b909931088a318c08041b0e7974a000 (patch)
tree2770f9b084e2efc7fc55e96e9bf4352cf2ff33a3 /imap/docs/rfc/rfc5051.txt
parentbdfc834badee92ceeb2befe02f1d065ced5b9ddf (diff)
downloadalpine-7fe712882b909931088a318c08041b0e7974a000.tar.xz
* Update to version 2.19.1
* Upgrade UW-IMAP to Panda IMAP from https://github.com/jonabbey/panda-imap. * Replace tabs by spaces in From and Subject fields to control for size in screen of these fields. Change only in index screen display.
Diffstat (limited to 'imap/docs/rfc/rfc5051.txt')
-rw-r--r--imap/docs/rfc/rfc5051.txt395
1 files changed, 395 insertions, 0 deletions
diff --git a/imap/docs/rfc/rfc5051.txt b/imap/docs/rfc/rfc5051.txt
new file mode 100644
index 00000000..0a4479ca
--- /dev/null
+++ b/imap/docs/rfc/rfc5051.txt
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+Network Working Group M. Crispin
+Request for Comments: 5051 University of Washington
+Category: Standards Track October 2007
+
+
+ i;unicode-casemap - Simple Unicode Collation Algorithm
+
+Status of This Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Abstract
+
+ This document describes "i;unicode-casemap", a simple case-
+ insensitive collation for Unicode strings. It provides equality,
+ substring, and ordering operations.
+
+1. Introduction
+
+ The "i;ascii-casemap" collation described in [COMPARATOR] is quite
+ simple to implement and provides case-independent comparisons for the
+ 26 Latin alphabetics. It is specified as the default and/or baseline
+ comparator in some application protocols, e.g., [IMAP-SORT].
+
+ However, the "i;ascii-casemap" collation does not produce
+ satisfactory results with non-ASCII characters. It is possible, with
+ a modest extension, to provide a more sophisticated collation with
+ greater multilingual applicability than "i;ascii-casemap". This
+ extension provides case-independent comparisons for a much greater
+ number of characters. It also collates characters with diacriticals
+ with the non-diacritical character forms.
+
+ This collation, "i;unicode-casemap", is intended to be an alternative
+ to, and preferred over, "i;ascii-casemap". It does not replace the
+ "i;basic" collation described in [BASIC].
+
+2. Unicode Casemap Collation Description
+
+ The "i;unicode-casemap" collation is a simple collation which is
+ case-insensitive in its treatment of characters. It provides
+ equality, substring, and ordering operations. The validity test
+ operation returns "valid" for any input.
+
+
+
+
+
+Crispin Standards Track [Page 1]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+ This collation allows strings in arbitrary (and mixed) character
+ sets, as long as the character set for each string is identified and
+ it is possible to convert the string to Unicode. Strings which have
+ an unidentified character set and/or cannot be converted to Unicode
+ are not rejected, but are treated as binary.
+
+ Each input string is prepared by converting it to a "titlecased
+ canonicalized UTF-8" string according to the following steps, using
+ UnicodeData.txt ([UNICODE-DATA]):
+
+ (1) A Unicode codepoint is obtained from the input string.
+
+ (a) If the input string is in a known charset that can be
+ converted to Unicode, a sequence in the string's charset
+ is read and checked for validity according to the rules of
+ that charset. If the sequence is valid, it is converted
+ to a Unicode codepoint. Note that for input strings in
+ UTF-8, the UTF-8 sequence must be valid according to the
+ rules of [UTF-8]; e.g., overlong UTF-8 sequences are
+ invalid.
+
+ (b) If the input string is in an unknown charset, or an
+ invalid sequence occurs in step (1)(a), conversion ceases.
+ No further preparation is performed, and any partial
+ preparation results are discarded. The original string is
+ used unchanged with the i;octet comparator.
+
+ (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
+ are performed on the resulting codepoint from step (1)(a).
+
+ (a) If the codepoint has a titlecase property in
+ UnicodeData.txt (this is normally the same as the
+ uppercase property), the codepoint is converted to the
+ codepoints in the titlecase property.
+
+ (b) If the resulting codepoint from (2)(a) has a decomposition
+ property of any type in UnicodeData.txt, the codepoint is
+ converted to the codepoints in the decomposition property.
+ This step is recursively applied to each of the resulting
+ codepoints until no more decomposition is possible
+ (effectively Normalization Form KD).
+
+ Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
+ has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
+ WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
+ decomposition property of U+0044 (LATIN CAPITAL LETTER D)
+ U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
+ decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
+
+
+
+Crispin Standards Track [Page 2]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+ (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
+ any decomposition properties. Therefore, U+01C4 is converted
+ to U+0044 U+007A U+030C by this step.
+
+ (3) The resulting codepoint(s) from step (2) is/are appended, in
+ UTF-8 format, to the "titlecased canonicalized UTF-8" string.
+
+ (4) Repeat from step (1) until there is no more data in the input
+ string.
+
+ Following the above preparation process on each string, the equality,
+ ordering, and substring operations are as for i;octet.
+
+ It is permitted to use an alternative implementation of the above
+ preparation process if it produces the same results. For example, it
+ may be more convenient for an implementation to convert all input
+ strings to a sequence of UTF-16 or UTF-32 values prior to performing
+ any of the step (2) actions. Similarly, if all input strings are (or
+ are convertible to) Unicode, it may be possible to use UTF-32 as an
+ alternative to UTF-8 in step (3).
+
+ Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
+ because UTF-16 surrogates will cause i;octet to collate codepoints
+ U+E0000 through U+FFFF after non-BMP codepoints.
+
+ This collation is not locale sensitive. Consequently, care should be
+ taken when using OS-supplied functions to implement this collation.
+ Functions such as strcasecmp and toupper are sometimes locale
+ sensitive and may inconsistently casemap letters.
+
+ The i;unicode-casemap collation is well suited to use with many
+ Internet protocols and computer languages. Use with natural language
+ is often inappropriate; even though the collation apparently supports
+ languages such as Swahili and English, in real-world use it tends to
+ mis-sort a number of types of string:
+
+ o people and place names containing scripts that are not collated
+ according to "alphabetical order".
+ o words with characters that have diacriticals. However,
+ i;unicode-casemap generally does a better job than i;ascii-casemap
+ for most (but not all) languages. For example, German umlaut
+ letters will sort correctly, but some Scandinavian letters will
+ not.
+ o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
+ in English),
+ o strings containing other non-letter symbols; e.g., euro and pound
+ sterling symbols, quotation marks other than '"', dashes/hyphens,
+ etc.
+
+
+
+Crispin Standards Track [Page 3]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+3. Unicode Casemap Collation Registration
+
+ <?xml version='1.0'?>
+ <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
+ <collation rfc="5051" scope="global" intendedUse="common">
+ <identifier>i;unicode-casemap</identifier>
+ <title>Unicode Casemap</title>
+ <operations>equality order substring</operations>
+ <specification>RFC 5051</specification>
+ <owner>IETF</owner>
+ <submitter>mrc@cac.washington.edu</submitter>
+ </collation>
+
+4. Security Considerations
+
+ The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
+ SECURITY] apply and are normative to this specification.
+
+ The results from this comparator will vary depending upon the
+ implementation for several reasons. Implementations MUST consider
+ whether these possibilities are a problem for their use case:
+
+ 1) New characters added in Unicode may have decomposition or
+ titlecase properties that will not be known to an implementation
+ based upon an older revision of Unicode. This impacts step (2).
+
+ 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
+ does not require normalization of out-of-order diacriticals.
+ However, an implementation MAY use an NFKD library routine that
+ does such normalization. This impacts step (2)(b) and possibly
+ also step (1)(a), and is an issue only with ill-formed UTF-8
+ input.
+
+ 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
+ (and, by extension, US-ASCII) are the only mandatory-to-implement
+ charsets. This impacts step (1)(a).
+
+ Implementations SHOULD, as far as feasible, support all the
+ charsets they are likely to encounter in the input data, in order
+ to avoid poor collation caused by the fall through to the (1)(b)
+ rule.
+
+ 4) Other charsets may have revisions which add new characters that
+ are not known to an implementation based upon an older revision.
+ This impacts step (1)(a) and possibly also step (1)(b).
+
+
+
+
+
+
+Crispin Standards Track [Page 4]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+ An attacker may create input that is ill-formed or in an unknown
+ charset, with the intention of impacting the results of this
+ comparator or exploiting other parts of the system which process this
+ input in different ways. Note, however, that even well-formed data
+ in a known charset can impact the result of this comparator in
+ unexpected ways. For example, an attacker can substitute U+0041
+ (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
+ U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
+ non-match of strings which visually appear the same and/or causing
+ the string to appear elsewhere in a sort.
+
+5. IANA Considerations
+
+ The i;unicode-casemap collation defined in section 2 has been added
+ to the registry of collations defined in [COMPARATOR].
+
+6. Normative References
+
+ [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
+ "Internet Application Protocol Collation
+ Registry", RFC 4790, February 2007.
+
+ [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
+ Internationalized Strings ("stringprep")", RFC
+ 3454, December 2002.
+
+ [UTF-8] Yergeau, F., "UTF-8, a transformation format of
+ ISO 10646", STD 63, RFC 3629, November 2003.
+
+ [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
+ UnicodeData.txt>
+
+ Although the UnicodeData.txt file referenced
+ here is part of the Unicode standard, it is
+ subject to change as new characters are added
+ to Unicode and errors are corrected in Unicode
+ revisions. As a result, it may be less stable
+ than might otherwise be implied by the
+ standards status of this specification.
+
+ [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
+ Considerations", February 2006,
+ <http://www.unicode.org/reports/tr36/>.
+
+
+
+
+
+
+
+
+Crispin Standards Track [Page 5]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+7. Informative References
+
+ [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
+ "i;basic - the Unicode Collation Algorithm",
+ Work in Progress, March 2007.
+
+ [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
+ Access Protocol - SORT and THREAD Extensions",
+ Work in Progress, September 2007.
+
+Author's Address
+
+ Mark R. Crispin
+ Networks and Distributed Computing
+ University of Washington
+ 4545 15th Avenue NE
+ Seattle, WA 98105-4527
+
+ Phone: +1 (206) 543-5762
+ EMail: MRC@CAC.Washington.EDU
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Crispin Standards Track [Page 6]
+
+RFC 5051 i;unicode-casemap October 2007
+
+
+Full Copyright Statement
+
+ Copyright (C) The IETF Trust (2007).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78, and except as set forth therein, the authors
+ retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
+ THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
+ OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
+ THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the procedures with respect to rights in RFC documents can be
+ found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at
+ ietf-ipr@ietf.org.
+
+
+
+
+
+
+
+
+
+
+
+
+Crispin Standards Track [Page 7]
+