summaryrefslogtreecommitdiff
path: root/imap/docs/formats.txt
diff options
context:
space:
mode:
Diffstat (limited to 'imap/docs/formats.txt')
-rw-r--r--imap/docs/formats.txt217
1 files changed, 217 insertions, 0 deletions
diff --git a/imap/docs/formats.txt b/imap/docs/formats.txt
new file mode 100644
index 00000000..8dfb9dae
--- /dev/null
+++ b/imap/docs/formats.txt
@@ -0,0 +1,217 @@
+/* ========================================================================
+ * Copyright 1988-2006 University of Washington
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *
+ * ========================================================================
+ */
+
+ Mailbox Format Characteristics
+ Mark Crispin
+ 11 December 2006
+
+
+ When a mailbox storage technology uses local files and
+directories directly, the file(s) and directories are layed out in a
+mailbox format.
+
+I. Flat-File Formats
+
+ In these formats, a mailbox and all the messages inside are a
+single file on the filesystem. The mailbox name is the name of the
+file in the filesystem, relative to the user's "mail home directory."
+
+ A flat-file format mailbox is always a file, never a directory.
+This means that it is impossible to have a flat-file format mailbox
+that has inferior mailbox names under it (so-called "dual-usage"
+mailboxes). For some inexplicable reason, some people want this.
+
+ The mail home directory is usually the same as the user login
+home directory if that concept is meaningful; otherwise, it is some
+other default directory (e.g. "C:\My Documents" on Windows 98). This
+can be redefined by modifying the c-client source code or in an
+application via the SET_HOMEDIR mail_parameters() call.
+
+ For example, a mailbox named "project" is likely to be found in
+the file "project" in the user's home directory. Similarly, a mailbox
+named "test/trial1" (assuming a UNIX system) is likely to be found in
+the file "trial1" in the subdirectory "test" in the user's home
+directory.
+
+ Note that the name "INBOX" has special semantics and rules, as
+described in the file naming.txt.
+
+ The following flat-file formats are supported by c-client as of
+the time of this writing:
+
+. unix This is the traditional UNIX mailbox format, in use for nearly
+ 30 years. It uses a line starting with "From " to indicate
+ start of message, and stores the message status inside the
+ RFC822 message header.
+
+ unix is not particularly efficient; the entire mailbox file
+ must be read when the mailbox is open, and when reading message
+ texts it is necessary to convert the newline convention to
+ Internet standard CR LF form. unix preserves UIDs, and allows
+ the creation of keywords.
+
+ Only one process may have a unix-format mailbox open
+ read/write at a time.
+
+. mmdf This is the format used by the MMDF mailer. It uses a line
+ consisting of 4 <CTRL/A> (0x01) characters to indicate start
+ and end of message. Optionally, there may also be a unix
+ format "From " line. It otherwise has the same
+ characteristics as unix format.
+
+. mbx This is the current preferred mailbox format. It can be
+ handled quite efficiently by c-client, without the problems
+ that exist with unix and mmdf formats. Messages are stored
+ in Internet standard CR LF format.
+
+ mbx permits shared access, including shared expunge. It
+ preserves UIDs, and allows the creation of keywords.
+
+. mtx This is supported for compatibility with the past. This is
+ the old Tenex/TOPS-20 mail.txt format. It can be handled
+ quite efficiently by c-client, and has most of the
+ characteristics of mbx format.
+
+ mtx is deficient in that it does not support shared expunge;
+ it has no means to store UIDs; and it has no way to define
+ keywords except through an external configuration file.
+
+. tenex This is supported for compatibility with the past. This is
+ the old Columbia MM format. This is similar to mtx format,
+ only it uses UNIX-style bare-LF newlines instead of CR LF
+ newlines, thus incurring a performance penalty for newline
+ conversion.
+
+. phile This is not strictly a format. Any file which is not in a
+ recognized format is in phile format, which treats the entire
+ contents of the file as a single message.
+
+
+II. File/Message Formats
+
+ In these formats, a mailbox is a directory, and each the messages
+inside are separate files inside the directory. The file names of
+these files are generally the text form of a number, which also
+matches the UID of the message.
+
+ In the case of mx, the mailbox name is the name of the directory
+in the filesystem, relative to the user's "mail home directory." In
+the case of news and mh, the mailbox name is in a separate namespace
+as described in the file naming.txt.
+
+ A file/message format mailbox is always a directory. This means
+that it is possible to have a file/message format mailbox that has
+inferior mailbox names under it (so-called "dual-usage" mailboxes).
+For some inexplicable reason, some people want this.
+
+ Note that the name "INBOX" has special semantics and rules, as
+described in the file naming.txt.
+
+ The following file/message formats are supported by c-client as of
+the time of this writing:
+
+. mx This is an experimental format, and may be removed in a future
+ release. An mx format mailbox has a .mxindex file which holds
+ the message status and unique identifiers. Messages are
+ stored in Internet standard CF LF form, so the file size of
+ the message file equals the size of the message.
+
+ mx is somewhat inefficient; the entire directory must be read
+ and each file stat()'d. We found it intolerable for a
+ moderate sized mailbox (2000 messages) and have more or less
+ abandoned it.
+
+. mh This is supported for compatibility with the past. This is
+ the format used by the old mh program.
+
+ mh is very inefficient; the entire directory must be read
+ and each file stat()'d, and in order to determine the size
+ of a message, the entire file must be read and newline
+ conversion performed.
+
+ mh is deficient in that it does not support any permanent
+ flags or keywords; and has no means to store UIDs (because
+ the mh "compress" command renames all the files, that's
+ why).
+
+. news This is an export of the local filesystem's news spool, e.g.
+ /var/spool/news. Access to mailboxes in news format is read
+ only; however, message "deleted" status is preserved in a
+ .newsrc file in the user's home directory. There is no other
+ status or keywords.
+
+ news is very inefficient; the entire directory must be
+ read and each file stat()'d, and in order to determine the
+ size of a message, the entire file must be read and newline
+ conversion performed.
+
+ news is deficient in that it does not support permanent flags
+ other than deleted; does not support keywords; and has no
+ expunge.
+
+
+Soapbox on File/Message Formats
+
+ If it sounds from the above descriptions that we're not putting
+too much effort into file/message formats, you are correct.
+
+ There's a general reason why file/message formats are a bad idea.
+Just about every filesystem in existance serializes file creation and
+deletions because these manipulate the free space map. This turns out
+to be an enormous problem when you start creating/deleting more than a
+few messages per second; you spend all your time thrashing in the
+filesystem.
+
+ It is also extremely slow to do a text search through a
+file/message format mailbox. All of those open()s and close()s really
+add up to major filesystem thrashing.
+
+
+What about Cyrus and Maildir?
+
+ Both formats are vulnerable to the filesystem thrashing outlined
+above.
+
+ The Cyrus format used by CMU's Cyrus server (and Esys' server)
+has a special associated flat file in each directory that contains
+extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs)
+about the messages. Put another way, it's a (considerably) more
+featureful form of mx. It also uses certain operating system
+facilities (e.g. file/memory mapping) which are not available on older
+systems, at a cost of much more limited portability than c-client.
+These considerably ameliorate the fundamental problems with
+file/message formats; in fact, Cyrus is halfway to being a database.
+Rather than support Cyrus format in c-client, you should run Cyrus or
+Esys if you want that format.
+
+ The Maildir format used by qmail has all of the performance
+disadvantages of mh noted above, with the additional problem that the
+files are renamed in order to change their status so you end up having
+to rescan the directory frequently to locate the current names
+(particularly in a shared mailbox scenario). It doesn't scale, and it
+represents a support nightmare; it is therefore not supported in the
+official distribution. Maildir support code for c-client is available
+from third parties; but, if you use it, it is entirely at your own
+risk (read: don't complain about how poorly it performs or bugs).
+
+
+So what does this all mean?
+
+ A database (such as used by Exchange) is really a much better
+approach if you want to move away from flat files. mx and especially
+Cyrus take a tenative step in that direction; mx failed mostly because
+it didn't go anywhere near far enough. Cyrus goes much further, and
+scores remarkable benefits from doing so.
+
+ However, a well-designed pure database without the overhead of
+separate files would do even better.