1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
|
% public-inbox developer manual
=head1 NAME
public-inbox-v1-format - git repository and tree description (aka "ssoma")
=head1 DESCRIPTION
WARNING: this does NOT describe the scalable v2 format used
by public-inbox. Use of ssoma is not recommended for new
installations due to scalability problems.
ssoma uses a git repository to store each email as a git blob.
The tree filename of the blob is based on the SHA1 hexdigest of
the first Message-ID header. A commit is made for each message
delivered. The commit SHA-1 identifier is used by ssoma clients
to track synchronization state.
=head1 PATHNAMES IN TREES
A Message-ID may be extremely long and also contain slashes, so using
them as a path name is challenging. Instead we use the SHA-1 hexdigest
of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
to generate a path name. Leading and trailing white space in the
Message-ID header is ignored for hashing.
A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>
Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63
Thus it is easy to look up the contents of a message matching a given
a Message-ID.
=head1 MESSAGE-ID CONFLICTS
public-inbox v1 repositories currently do not resolve conflicting
Message-IDs or messages with multiple Message-IDs.
=head1 HEADERS
The Message-ID header is required.
"Bytes", "Lines" and "Content-Length" headers are stripped and not
allowed, they can interfere with further processing.
When using ssoma with public-inbox-mda, the "Status" mbox header
is also stripped as that header makes no sense in a public archive.
=head1 LOCKING
L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
for all non-atomic operations.
=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA)
1. Message is delivered to a mail transport agent (MTA)
1a. (optional) reject/discard spam, this should run before ssoma-mda
1b. (optional) reject/strip unwanted attachments
ssoma-mda handles all steps once invoked.
2. Mail transport agent invokes ssoma-mda
3. reads message via stdin, extracting Message-ID
4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock
5. creates or updates the blob of associated 2/38 SHA-1 path
6. updates the index and commits
7. releases $GIT_DIR/ssoma.lock
ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
and the ability to monitor IMAP mailboxes using IDLE will be available
in the future.
=head1 GIT REPOSITORIES (SERVERS)
ssoma uses bare git repositories on both servers and clients.
Using the L<git-init(1)> command with --bare is the recommend method
of creating a git repository on a server:
git init --bare /path/to/wherever/you/want.git
There are no standardized paths for servers, administrators make
all the choices regarding git repository locations.
Special files in $GIT_DIR on the server:
=over
=item $GIT_DIR/ssoma.lock
An empty file for L<flock(2)> locking.
This is necessary to ensure the index and commits are updated
consistently and multiple processes running MDA do not step on
each other.
=item $GIT_DIR/public-inbox/msgmap.sqlite3
SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
article numbers. Used by L<public-inbox-nntpd(1)> and created
and updated by L<public-inbox-index(1)>.
Users of the L<PublicInbox::WWW> interface will find it
useful for attempting recovery from copy-paste truncations of
URLs containing long Message-IDs.
Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
Losing or damaging this file will cause synchronization problems for
NNTP clients. This file is expected to be stable and require no
updates to its schema.
Requires L<DBD::SQLite>.
=item $GIT_DIR/public-inbox/xapian$N/
Xapian database for search indices in the PSGI web UI.
$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
installations may have parallel versions on disk during upgrades
or to roll-back upgrades.
This is created and updated by L<public-inbox-index(1)>.
Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
This directory can always be regenerated with L<public-inbox-index(1)>.
If lost or damaged, there is no need to back it up unless the
CPU/memory cost of regenerating it outweighs the storage/transfer cost.
Since SCHEMA_VERSION 15 and the development of the v2 format,
the "overview" DB also exists in the xapian directory for v1
repositories. See L<public-inbox-v2-format(5)/OVERVIEW DB>
Our use of the L</OVERVIEW DB> requires Xapian document IDs to
remain stable. Using L<public-inbox-compact(1)> and
L<public-inbox-xcpdb(1)> wrappers are recommended over tools
provided by Xapian.
This directory is large, often two to three times the size of
the objects stored in a packed git repository.
=item $GIT_DIR/ssoma.index
This file is no longer used or created by public-inbox, but it is
updated if it exists to remain compatible with ssoma installations.
A git index file used for MDA updates. The normal git index (in
$GIT_DIR/index) is not used at all as there is typically no working
tree.
=back
Each client $GIT_DIR may have multiple mbox/maildir/command targets.
It is possible for a client to extract the mail stored in the git
repository to multiple mboxes for compatibility with a variety of
different tools.
=head1 CAVEATS
It is NOT recommended to check out the working directory of a git.
there may be many files.
It is impossible to completely expunge messages, even spam, as git
retains full history. Projects may (with adequate notice) cycle to new
repositories/branches with history cleaned up via L<git-filter-repo(1)>
or L<git-filter-branch(1)>.
This is up to the administrators.
=head1 COPYRIGHT
Copyright 2013-2021 all contributors L<mailto:meta@public-inbox.org>
License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
=head1 SEE ALSO
L<gitrepository-layout(5)>, L<ssoma(1)>
|