diff options
Diffstat (limited to 'imap/docs/formats.txt')
-rw-r--r-- | imap/docs/formats.txt | 217 |
1 files changed, 217 insertions, 0 deletions
diff --git a/imap/docs/formats.txt b/imap/docs/formats.txt new file mode 100644 index 00000000..8dfb9dae --- /dev/null +++ b/imap/docs/formats.txt @@ -0,0 +1,217 @@ +/* ======================================================================== + * Copyright 1988-2006 University of Washington + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * + * ======================================================================== + */ + + Mailbox Format Characteristics + Mark Crispin + 11 December 2006 + + + When a mailbox storage technology uses local files and +directories directly, the file(s) and directories are layed out in a +mailbox format. + +I. Flat-File Formats + + In these formats, a mailbox and all the messages inside are a +single file on the filesystem. The mailbox name is the name of the +file in the filesystem, relative to the user's "mail home directory." + + A flat-file format mailbox is always a file, never a directory. +This means that it is impossible to have a flat-file format mailbox +that has inferior mailbox names under it (so-called "dual-usage" +mailboxes). For some inexplicable reason, some people want this. + + The mail home directory is usually the same as the user login +home directory if that concept is meaningful; otherwise, it is some +other default directory (e.g. "C:\My Documents" on Windows 98). This +can be redefined by modifying the c-client source code or in an +application via the SET_HOMEDIR mail_parameters() call. + + For example, a mailbox named "project" is likely to be found in +the file "project" in the user's home directory. Similarly, a mailbox +named "test/trial1" (assuming a UNIX system) is likely to be found in +the file "trial1" in the subdirectory "test" in the user's home +directory. + + Note that the name "INBOX" has special semantics and rules, as +described in the file naming.txt. + + The following flat-file formats are supported by c-client as of +the time of this writing: + +. unix This is the traditional UNIX mailbox format, in use for nearly + 30 years. It uses a line starting with "From " to indicate + start of message, and stores the message status inside the + RFC822 message header. + + unix is not particularly efficient; the entire mailbox file + must be read when the mailbox is open, and when reading message + texts it is necessary to convert the newline convention to + Internet standard CR LF form. unix preserves UIDs, and allows + the creation of keywords. + + Only one process may have a unix-format mailbox open + read/write at a time. + +. mmdf This is the format used by the MMDF mailer. It uses a line + consisting of 4 <CTRL/A> (0x01) characters to indicate start + and end of message. Optionally, there may also be a unix + format "From " line. It otherwise has the same + characteristics as unix format. + +. mbx This is the current preferred mailbox format. It can be + handled quite efficiently by c-client, without the problems + that exist with unix and mmdf formats. Messages are stored + in Internet standard CR LF format. + + mbx permits shared access, including shared expunge. It + preserves UIDs, and allows the creation of keywords. + +. mtx This is supported for compatibility with the past. This is + the old Tenex/TOPS-20 mail.txt format. It can be handled + quite efficiently by c-client, and has most of the + characteristics of mbx format. + + mtx is deficient in that it does not support shared expunge; + it has no means to store UIDs; and it has no way to define + keywords except through an external configuration file. + +. tenex This is supported for compatibility with the past. This is + the old Columbia MM format. This is similar to mtx format, + only it uses UNIX-style bare-LF newlines instead of CR LF + newlines, thus incurring a performance penalty for newline + conversion. + +. phile This is not strictly a format. Any file which is not in a + recognized format is in phile format, which treats the entire + contents of the file as a single message. + + +II. File/Message Formats + + In these formats, a mailbox is a directory, and each the messages +inside are separate files inside the directory. The file names of +these files are generally the text form of a number, which also +matches the UID of the message. + + In the case of mx, the mailbox name is the name of the directory +in the filesystem, relative to the user's "mail home directory." In +the case of news and mh, the mailbox name is in a separate namespace +as described in the file naming.txt. + + A file/message format mailbox is always a directory. This means +that it is possible to have a file/message format mailbox that has +inferior mailbox names under it (so-called "dual-usage" mailboxes). +For some inexplicable reason, some people want this. + + Note that the name "INBOX" has special semantics and rules, as +described in the file naming.txt. + + The following file/message formats are supported by c-client as of +the time of this writing: + +. mx This is an experimental format, and may be removed in a future + release. An mx format mailbox has a .mxindex file which holds + the message status and unique identifiers. Messages are + stored in Internet standard CF LF form, so the file size of + the message file equals the size of the message. + + mx is somewhat inefficient; the entire directory must be read + and each file stat()'d. We found it intolerable for a + moderate sized mailbox (2000 messages) and have more or less + abandoned it. + +. mh This is supported for compatibility with the past. This is + the format used by the old mh program. + + mh is very inefficient; the entire directory must be read + and each file stat()'d, and in order to determine the size + of a message, the entire file must be read and newline + conversion performed. + + mh is deficient in that it does not support any permanent + flags or keywords; and has no means to store UIDs (because + the mh "compress" command renames all the files, that's + why). + +. news This is an export of the local filesystem's news spool, e.g. + /var/spool/news. Access to mailboxes in news format is read + only; however, message "deleted" status is preserved in a + .newsrc file in the user's home directory. There is no other + status or keywords. + + news is very inefficient; the entire directory must be + read and each file stat()'d, and in order to determine the + size of a message, the entire file must be read and newline + conversion performed. + + news is deficient in that it does not support permanent flags + other than deleted; does not support keywords; and has no + expunge. + + +Soapbox on File/Message Formats + + If it sounds from the above descriptions that we're not putting +too much effort into file/message formats, you are correct. + + There's a general reason why file/message formats are a bad idea. +Just about every filesystem in existance serializes file creation and +deletions because these manipulate the free space map. This turns out +to be an enormous problem when you start creating/deleting more than a +few messages per second; you spend all your time thrashing in the +filesystem. + + It is also extremely slow to do a text search through a +file/message format mailbox. All of those open()s and close()s really +add up to major filesystem thrashing. + + +What about Cyrus and Maildir? + + Both formats are vulnerable to the filesystem thrashing outlined +above. + + The Cyrus format used by CMU's Cyrus server (and Esys' server) +has a special associated flat file in each directory that contains +extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs) +about the messages. Put another way, it's a (considerably) more +featureful form of mx. It also uses certain operating system +facilities (e.g. file/memory mapping) which are not available on older +systems, at a cost of much more limited portability than c-client. +These considerably ameliorate the fundamental problems with +file/message formats; in fact, Cyrus is halfway to being a database. +Rather than support Cyrus format in c-client, you should run Cyrus or +Esys if you want that format. + + The Maildir format used by qmail has all of the performance +disadvantages of mh noted above, with the additional problem that the +files are renamed in order to change their status so you end up having +to rescan the directory frequently to locate the current names +(particularly in a shared mailbox scenario). It doesn't scale, and it +represents a support nightmare; it is therefore not supported in the +official distribution. Maildir support code for c-client is available +from third parties; but, if you use it, it is entirely at your own +risk (read: don't complain about how poorly it performs or bugs). + + +So what does this all mean? + + A database (such as used by Exchange) is really a much better +approach if you want to move away from flat files. mx and especially +Cyrus take a tenative step in that direction; mx failed mostly because +it didn't go anywhere near far enough. Cyrus goes much further, and +scores remarkable benefits from doing so. + + However, a well-designed pure database without the overhead of +separate files would do even better. |