summaryrefslogtreecommitdiff
path: root/imap/docs/formats.txt
blob: 8dfb9dae2ee6ecc937b408d5a4bafb13a1d65931 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
/* ========================================================================
 * Copyright 1988-2006 University of Washington
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * 
 * ========================================================================
 */

		    Mailbox Format Characteristics
			     Mark Crispin
			   11 December 2006


     When a mailbox storage technology uses local files and
directories directly, the file(s) and directories are layed out in a
mailbox format.

I. Flat-File Formats

     In these formats, a mailbox and all the messages inside are a
single file on the filesystem.  The mailbox name is the name of the
file in the filesystem, relative to the user's "mail home directory."

     A flat-file format mailbox is always a file, never a directory.
This means that it is impossible to have a flat-file format mailbox
that has inferior mailbox names under it (so-called "dual-usage"
mailboxes).  For some inexplicable reason, some people want this.

     The mail home directory is usually the same as the user login
home directory if that concept is meaningful; otherwise, it is some
other default directory (e.g. "C:\My Documents" on Windows 98).  This
can be redefined by modifying the c-client source code or in an
application via the SET_HOMEDIR mail_parameters() call.

     For example, a mailbox named "project" is likely to be found in
the file "project" in the user's home directory.  Similarly, a mailbox
named "test/trial1" (assuming a UNIX system) is likely to be found in
the file "trial1" in the subdirectory "test" in the user's home
directory.

     Note that the name "INBOX" has special semantics and rules, as
described in the file naming.txt.

     The following flat-file formats are supported by c-client as of
the time of this writing:

. unix	This is the traditional UNIX mailbox format, in use for nearly
	30 years.  It uses a line starting with "From " to indicate
	start of message, and stores the message status inside the
	RFC822 message header.

	unix is not particularly efficient; the entire mailbox file
	must be read when the mailbox is open, and when reading message
	texts it is necessary to convert the newline convention to
	Internet standard CR LF form.  unix preserves UIDs, and allows
	the creation of keywords.

	Only one process may have a unix-format mailbox open
	read/write at a time.

. mmdf	This is the format used by the MMDF mailer.  It uses a line
	consisting of 4 <CTRL/A> (0x01) characters to indicate start
	and end of message.  Optionally, there may also be a unix
	format "From " line.  It otherwise has the same
	characteristics as unix format.

. mbx	This is the current preferred mailbox format.  It can be
	handled quite efficiently by c-client, without the problems
	that exist with unix and mmdf formats.  Messages are stored
	in Internet standard CR LF format.

	mbx permits shared access, including shared expunge.  It
	preserves UIDs, and allows the creation of keywords.

. mtx	This is supported for compatibility with the past.  This is
	the old Tenex/TOPS-20 mail.txt format.  It can be handled
	quite efficiently by c-client, and has most of the
	characteristics of mbx format.

	mtx is deficient in that it does not support shared expunge;
	it has no means to store UIDs; and it has no way to define
	keywords except through an external configuration file.

. tenex	This is supported for compatibility with the past.  This is
	the old Columbia MM format.  This is similar to mtx format,
	only it uses UNIX-style bare-LF newlines instead of CR LF
	newlines, thus incurring a performance penalty for newline
	conversion.

. phile	This is not strictly a format.  Any file which is not in a
	recognized format is in phile format, which treats the entire
	contents of the file as a single message.


II. File/Message Formats

     In these formats, a mailbox is a directory, and each the messages
inside are separate files inside the directory.  The file names of
these files are generally the text form of a number, which also
matches the UID of the message.

     In the case of mx, the mailbox name is the name of the directory
in the filesystem, relative to the user's "mail home directory."  In
the case of news and mh, the mailbox name is in a separate namespace
as described in the file naming.txt.

     A file/message format mailbox is always a directory.  This means
that it is possible to have a file/message format mailbox that has
inferior mailbox names under it (so-called "dual-usage" mailboxes).
For some inexplicable reason, some people want this.

     Note that the name "INBOX" has special semantics and rules, as
described in the file naming.txt.

     The following file/message formats are supported by c-client as of
the time of this writing:

. mx	This is an experimental format, and may be removed in a future
	release.  An mx format mailbox has a .mxindex file which holds
	the message status and unique identifiers.  Messages are
	stored in Internet standard CF LF form, so the file size of
	the message file equals the size of the message.

	mx is somewhat inefficient; the entire directory must be read
	and each file stat()'d.  We found it intolerable for a
	moderate sized mailbox (2000 messages) and have more or less
	abandoned it.	

. mh	This is supported for compatibility with the past.  This is
	the format used by the old mh program.

	mh is very inefficient; the entire directory must be read
	and each file stat()'d, and in order to determine the size
	of a message, the entire file must be read and newline
	conversion performed.

	mh is deficient in that it does not support any permanent
	flags or keywords; and has no means to store UIDs (because
	the mh "compress" command renames all the files, that's
	why).

. news	This is an export of the local filesystem's news spool, e.g.
	/var/spool/news.  Access to mailboxes in news format is read
	only; however, message "deleted" status is preserved in a
	.newsrc file in the user's home directory.  There is no other
	status or keywords.

	news is very inefficient; the entire directory must be
	read and each file stat()'d, and in order to determine the
	size of a message, the entire file must be read and newline
	conversion performed.

	news is deficient in that it does not support permanent flags
	other than deleted; does not support keywords; and has no
	expunge.


Soapbox on File/Message Formats

     If it sounds from the above descriptions that we're not putting
too much effort into file/message formats, you are correct.

     There's a general reason why file/message formats are a bad idea.
Just about every filesystem in existance serializes file creation and
deletions because these manipulate the free space map.  This turns out
to be an enormous problem when you start creating/deleting more than a
few messages per second; you spend all your time thrashing in the
filesystem.

     It is also extremely slow to do a text search through a
file/message format mailbox.  All of those open()s and close()s really
add up to major filesystem thrashing.


What about Cyrus and Maildir?

     Both formats are vulnerable to the filesystem thrashing outlined
above.

     The Cyrus format used by CMU's Cyrus server (and Esys' server)
has a special associated flat file in each directory that contains
extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs)
about the messages.  Put another way, it's a (considerably) more
featureful form of mx.  It also uses certain operating system
facilities (e.g. file/memory mapping) which are not available on older
systems, at a cost of much more limited portability than c-client.
These considerably ameliorate the fundamental problems with
file/message formats; in fact, Cyrus is halfway to being a database.
Rather than support Cyrus format in c-client, you should run Cyrus or
Esys if you want that format.

     The Maildir format used by qmail has all of the performance
disadvantages of mh noted above, with the additional problem that the
files are renamed in order to change their status so you end up having
to rescan the directory frequently to locate the current names
(particularly in a shared mailbox scenario).  It doesn't scale, and it
represents a support nightmare; it is therefore not supported in the
official distribution.  Maildir support code for c-client is available
from third parties; but, if you use it, it is entirely at your own
risk (read: don't complain about how poorly it performs or bugs).


So what does this all mean?

     A database (such as used by Exchange) is really a much better
approach if you want to move away from flat files.  mx and especially
Cyrus take a tenative step in that direction; mx failed mostly because
it didn't go anywhere near far enough.  Cyrus goes much further, and
scores remarkable benefits from doing so.

     However, a well-designed pure database without the overhead of
separate files would do even better.