Outlook Express Version  5.0 file format
by Will Kranz  email: w_kranz@conknet.com
web page:  http://www.conknet.com/~w_kranz/

The following was done for the fun of solving a puzzle,
or at least part of one.  It has been done by inspection,
I have no association with (nor love of) Microsoft.

I also find they tend to radically change the format of
their files each time they come out with a new version.
Therefore I think the concepts below will only be compatible
with Version 5.0 of Outlook Express and if there isn't a new
version out now, you can assume one will be soon!

Disclaimer:
There is no Warrantee, I'm not liable if this is wrong...
Use this information at your own risk.

Prior Work:
I went to http://www.wotsit.org when I became interested in
Outlook Express 5.0 files. In the Windows section for the DBX file
type I found dbx.zip.  A great starting point.  I might never
have gotten anywhere without the work of Simon Craythorn who
was later assisted by Jason Miller.  I wrote Simon and he
sent me an updated file with Jason's fix(s).

Quoating part of dbx.txt which is part of the Wotsit distribution,
Simon describes his algorithm as follows:

"We do however know what a message header structure looks like.  
 We also know that this is always on a 4 byte boundary.  
 Looking at the file I can also see that a message ALWAYS starts 
 with an RFC822 Message Header.

 So, by looking through the file, 4 bytes at a time we can locate 
 the start of messages.  Chain them together and export the result.  
 It's slow, it's crude, but it works!
"

Simon's sample code, 0E5.cpp scans the file looking for the start
of a message header by looking for the key RFC822 strings.  Then
he traverses the message which he determined has a linked list
structure.  There is some code in the routine oe5__import_message()
which frankly I don't understand.  Simon says it works, and since
I never actually got around to the final step of dumping messages
I guess I believe him!  The code in question advances through
the file in steps of 65536 while the header.lSectionSize > 512. 

I think this is not required if you get to the start of the
message via the table of contents as outlined below.

File Format Overview:

First let me say that I'm really only dealing with the *.dbx
files that contain messages.  There are others in the outlook
directory with the *.dbx extension which don't follow this
format, ie folders.dbx and pop3uidl.dbx.  What I describe here
is a subset of the entire format, and there are many areas
where I have NO idea as to what is going on.

I have some understanding of the following three sections:

File Header: Appears to extend to 0x2ad4, although what I
         understand occurs in the first 0x100 bytes.

Descriptive Data: Contains various string data, see
         DESC_HEAD.  To date I've only seen these
         allocated in 0xC000 byte block lengths.

Message Data: Contains the messages as a series of linked
         lists.  See DBX_HEAD, this is the structure
         identified by Simon Craythorn (thanks).
         All message data has flags == 0x200.
         To date I've only seen these allocated in 0xF780
         byte block lengths.

TOC Pointer list: Contains pointers to the table of contents
         data, ie offsets to DESC_HEAD locations.  I believe
         this is some form of doubly linked list, but
         don't have a big enough file to be sure.
         See TOC_HEAD which is a super set of DBX_HEAD,
         appears to have flags == 0x0.  To date I've only seen 
         these allocated in 0x3e1c byte block lengths.  Typically
         this will leave space for 25 entires of length 0x27C
         which is the apparent size of an individual TOC_HEAD
         entry with its associated data.

The extremely useful tip Simon gives is that a valid header
only occurs on a modulo 4 byte boundry.  The unsigned long
the begins such a header is equal to the offset in the file
of this header.  Hence one searchs for headers by looking for
longs with file offsets equal to the value at that location.

Outlook appears to allocate fairly large blocks for each type
of data.  A *.dbx file will always has a file header region.
As soon as there are one or more messages it will also have
the three other regions described above.  I don't see a pattern
to the order things are allocated, sometimes the Message data
is the next available location, 0x2ad4, and sometimes it will be
the Descriptive data.  However one can determine some of this
from key offsets in the File header region.  When one of the
previous allocated blocks fills, an additional block is allocated
at the end of the file allowing the system to grow.

File Header:
I've identified the following offsets at the beginning of the 
file that provide data about the rest of the file:
Offset                Description

0x0 1st 16 bytes probably a file GUID flag, see shortcut.pdf 
    Same bytes for all message files, but not for all *.dbx.

0x24 points to current block of what I'm calling Descriptive Data, 
    TOC data and flag 0x48 above entries appended in this block
    Typically 0xC000 bytes reserved for TOC data on first wack

0x28 is allocation length for each Descriptive Data block = 0xC000

0x30 points to TOC current header list block, header with flag == 0
    may be 0 if no message, ie no TOC
    Each flag = 0 block 0x27c bytes long, 1 block contains 25 of these
    May be more data after this, see inbox.dbx, and offset 0x7c

0x34 is allocation length for each stored at offset  = 0x3e1c bytes

0x3C points to active Message Data block header.  
    Typically DBX_HEAD.flags == 0x200
    This is where one adds next message
    Allocation seems to be 0xF780, TOC list records use a 3 byte
    offset, ie don't see how to exceed offset of 0xFFFFff.
    but this allows files up to 16.77 Megs, then what happens?

0x40 is allocation length for Message Data size = 0xF780

0xC4 active messages in file <= value at offset 0x5C

0x5C looks like total messages in file (including deleted)

0x7C points to what might be next available block?
    typically near EOF, but zeros in this area in all files
    that I've seen.

0x88 points to string describing flags in file
    all *.dbx with messages seem to have one of these...
    its a string per below in all cases where exists
    ie if 0x88 != 0, there are messages in file?
    defines some flags for the file.
    Note that DBX_HEAD.flags value 0x48 is its length!

0xE4 points to master TOC pointer list block, header with flag == 0
    may be 0 if no messages, ie no TOC
    may be eqaul to value at offset 0x30 if only one such block


Message Data:
One should enter the first record in this linked list via
the data following a TOC_ENT in the Descriptive Data which is
in turn pointed to by the data following TOC_HEAD entries.

typedef struct _dbx_head {
DWORD lpos, // current position
    flags,  // identifies type of data
    length, // of section
    next;   // link to next section
} DBX_HEAD;

This is a linked list structure.  I've only seen the flag 0x200
in this block.  A series of links are tied together through
the next offset.  When next==0 you're at the end of the data.
Length is normally, but not always, equal to 0x200.
I think longer blocks may incorporate deleted messages, but
not sure.  If you enter the linked list data throught a TOC
entry you always seem to get a series of nodes of length 0x200.
The last node in the list will have a next == 0x0 and a length
<= 0x200 (the last message record is rarely 0x200 bytes long.
However if one scans ahead to find the next message header,
you have to skip 0x200 bytes after the previous one to get to
the next so message headers seem to be located at 0x210 byte
intervals in the Message Data blocks.

TOC Pointer List:
I'm not sure about this one, its something like this, but
I don't have big enough files to be sure.  

// TOC pointer header
typedef struct _toc_head{
DWORD lpos,     // offset equal to position in file
      flags,    // always 0?
      prev_ptr, // back pointer for doubley linked list
      next,     // next pointer
      count,   // see shift required!
      unknown; // used as filler in dbx.c during read
} TOC_HEAD;

Its easy to spot cause the flags above = 0x0.

Per above, believed to be a doubley linked list.
File header offset 0x30 points to the TOC pointer currently
in use, ie the one with the most recent entries.
File header offset 0xE4 points to a master entry which may
be the same as the value at offset 0x30 (if there is only
one pointer block).  If TOC_HEAD.prev_ptr !=0 there are
additional pointer blocks which preceed the master entries.

If I'm right about this it could be a true doubley linked list
if the current block was linked in, but in my examples it isn't.
You get to the current entry from file header offset 0x30.  This 
header has a forward pointer, next, to the master entry which is 
also pointed to by file header offset 0xE4.

The master entry my have no back pointer, prev_ptr = 0.  If its
not 0, then the entry pointed to will in turn point back via its 
next.  I assume but don't have a file that proves that if there 
are 4 or more entries in the TOC pointer data (TOC_HEAD.flags == 0)
that they will be connected via the prev_ptr and next fields.

The current logic assumes one reads the TOC_HEAD structure.
This gives one the number of data entries that follow.
CAREFUL, to get the count left shift by 8 bits, the low byte
of this long has been zero in all the files I've seen.  It may
really be a flag as in the data following DESC_HEAD below.
Each data entry has 3 longs.  Of the three, only the 
function of the first it known.  Its the offset of a TOC_ENT
record.  In the only example I have, the last data entry in the 
master block pointed to by file header offset 0xE4 has non-zero
data entries at the 2nd and 3rd location which are respectively
the offset of the current TOC_HEAD block and the number of entries
it contains.  Hard to say all this is text, see dbx.c.


Descriptive Data:
I know what two types of this data are which is enough to parse
a message file.  All data in these blocks start with a DESC_HEAD.
Note rather than a flags value as in the DBX_HEAD, the second
long is the length.  You get to an individual entry via the
offset in the TOC pointer list described above.

typedef struct _desc_head {
DWORD lpos,
      length;
} DESC_HEAD;

The simplest entry is a string describing some flags for the file
(no idea what it means).  Its pointed to by file header at offset 0x88. 
Its a single NUL terminated string of length = DESC_HEAD.length-1.

The important entry is the table of contents, TOC, entry data.
This needs a lot more work, but I see enough to get one from
here to the Message data.

#define TOC_TYPE 0x1f  // mask for TOC entries to get type #
#pragma pack(1)  // an alternate view of longs in structure
typedef struct _toc_ent {
BYTE flag;  // some sort of bitmap
WORD data;  // if flag == 0x84 offset to message, often in 7th long
            // its in 3 bytes, ie up to 16.77 mb file
BYTE extra; // high byte of data? 
} TOC_ENT;
#pragma pack()

The data following the DESC_HEAD is variable.  After examining a
number of records I found that the first byte in the longs following
the header is a flag which indicates the function of the remaining
bytes in the long.  I don't know the meaning of the high order bits,
but masking this byte with 0x1F produces monotonically increasing
values in the range 0x0 to 0x1C.  I only recognize two of them,
several always seem to be present {0x1,0x2,0x4-0x6,0xC-0xD,
0x10-0x14,0x1A-0x1C} and some I've never seen {0x3,0x9,0xA,0xB,
0xF,0x15,0x1c}.
flag & TOC_TYPE              description
  0x1C        last in list of longs, the high order byte(s?)
              contain the length of the following string
              data, ie length to skip to next region.

  0x04        the high order three bytes represent the file offset
              of the first header, DBX_HEAD, for this message in 
              the Message Data block.  Note I have some concern
              about this, what happens if the file is longer than
              the 16.77 megabytes offset one can store in 3 byte
              binary location?

After finding a flag with type 0x1C the table of contents data
extracted from the message is stored in the following order:

8 bytes, a quad word, Win32 FILETIME = time message recieved
NUL terminated string, contains subject data
8 bytes, a quad word, Win32 FILETIME = time message last accessed?
NUL terminated string, contains subject data (again)
NUL terminated string, contains mail server name
NUL terminated string, contains "From:" name
NUL terminated string, contains "To:" name


Sample Code:
I wrote a test program dbx.c to validate these concepts.
A self-expanding LHarc archive is available at
   http://www.conknet.com/~w_kranz/wdbx.com
It contains the following:

Listing of archive : 'WDBX.COM'

  Name          Original    Packed  Ratio   Date     Time   Attr Type  CRC
--------------  --------  -------- ------ -------- -------- ---- ----- ----
  DBX-FMT.TXT      17040      7320  43.0%  0-09-10 11:03:40 a--w -lh1- E5B1
  DBX.C            22946      7473  32.6%  0-06-22 15:18:32 a--w -lh1- A275
  DBX.EXE          34934     16104  46.1%  0-06-22 15:00:06 a--w -lh1- 8BA7
--------------  --------  -------- ------ -------- --------
     3 files       74920     30897  41.2%  0-09-10 11:03:44

Its a MSDOS mode 16 bit program.
It really just demostrates what I'm talking about and lets one
validate the format of a file.  The command line arguments
are displayed as indicated below if it is executed with no arguments:

The terminating 0 byte is always there, but some of these strings
may be empty.  The length associated with to flag type 0x1C is
the number of bytes to skip to get to the location immediately following
the NUL for the final "To:" string above.  More binary data follows
whose purpose is unknown.

usage: dbx <filename> [-b] [-c#] [-d] [-f] [-m[#]] [-t]
optional args mutually exculsive, default dumps all headers
-b display known block allocations in file
-c# dump contents TOC list block at hex offset = # points to
-d just displays gross file composition
-f# scan for headers with a specific flag value, # in hex
-m scan for message blocks flag = 0x200 and count messages
-m# display message data from block starting at hex #
-t to test for TOC, lists all block entries

All program output goes to standard output, you must redirect it
to a file if you want to capture it.  The default option displays
all headers, ie 4 byte long locations in file that are equal to
the file offset at that location.  They are displayed as if they
were all DBX_HEAD, if you are in the Descriptive Data section
the flags is really the length, and the other two entries should be
ignored.

-f# is similar to the the default mode, but only searches for
   the specified flag value.  Nice to find where flags=0x0
   are located.

-m is simialar to -f200, but has some extra logic to detect if
   message blocks are extra long, or not contiguous.  Finds
   all DBX_HEAD.flags == 0x200.

-m# traces a single message starting at the offset given and
   continuing via the DBX_HEAD.next pointer until its 0.
   Just the DBX_HEAD data is displayed to trace a single message.

-b simplistically shows how blocks were allocated.  It does not
   check the file header for allocation sizes, but assumes the
   fixed sizes I've seen in the past.

-d looks at and displays information from the fixed offsets in
   the File header block.

-t test TOC pointer list.  Assumes doubley linked list format
   described above and displays offsets to all DESC_HEAD entries
   found in the pointer lists.

-c# displays DESC_HEAD data for one message.  It must be
   given one of the offsets from the -t option above and
   parses the descriptive strings from the appropriate location.
   The file offset of the starting DBX_HEAD in the Message 
   Data block is also show.


Unknown:
A lot.  The above is a step toward the entire format, but there
are big holes.  Using the -m option one finds message chains
that jump around, sometimes going backwards in file, then
forward again.  Pretty clearly the system can regain space lost
in deleted messages.  There must be an available block list somewhere.
I suspect its near or in the TOC pointer list and the DBX_HEAD.flag ==
0x0 area.

I'm not sure I've got the organization of the TOC pointer list
right.  I'd love to have someone with some larger files check this,
or send me a copy of their region which I call the TOC pointer
list where DBX_HEAD.flag == 0x0.