Outlook Express Version 5.0 file format by Will Kranz email: w_kranz@conknet.com web page: http://www.conknet.com/~w_kranz/ The following was done for the fun of solving a puzzle, or at least part of one. It has been done by inspection, I have no association with (nor love of) Microsoft. I also find they tend to radically change the format of their files each time they come out with a new version. Therefore I think the concepts below will only be compatible with Version 5.0 of Outlook Express and if there isn't a new version out now, you can assume one will be soon! Disclaimer: There is no Warrantee, I'm not liable if this is wrong... Use this information at your own risk. Prior Work: I went to http://www.wotsit.org when I became interested in Outlook Express 5.0 files. In the Windows section for the DBX file type I found dbx.zip. A great starting point. I might never have gotten anywhere without the work of Simon Craythorn who was later assisted by Jason Miller. I wrote Simon and he sent me an updated file with Jason's fix(s). Quoating part of dbx.txt which is part of the Wotsit distribution, Simon describes his algorithm as follows: "We do however know what a message header structure looks like. We also know that this is always on a 4 byte boundary. Looking at the file I can also see that a message ALWAYS starts with an RFC822 Message Header. So, by looking through the file, 4 bytes at a time we can locate the start of messages. Chain them together and export the result. It's slow, it's crude, but it works! " Simon's sample code, 0E5.cpp scans the file looking for the start of a message header by looking for the key RFC822 strings. Then he traverses the message which he determined has a linked list structure. There is some code in the routine oe5__import_message() which frankly I don't understand. Simon says it works, and since I never actually got around to the final step of dumping messages I guess I believe him! The code in question advances through the file in steps of 65536 while the header.lSectionSize > 512. I think this is not required if you get to the start of the message via the table of contents as outlined below. File Format Overview: First let me say that I'm really only dealing with the *.dbx files that contain messages. There are others in the outlook directory with the *.dbx extension which don't follow this format, ie folders.dbx and pop3uidl.dbx. What I describe here is a subset of the entire format, and there are many areas where I have NO idea as to what is going on. I have some understanding of the following three sections: File Header: Appears to extend to 0x2ad4, although what I understand occurs in the first 0x100 bytes. Descriptive Data: Contains various string data, see DESC_HEAD. To date I've only seen these allocated in 0xC000 byte block lengths. Message Data: Contains the messages as a series of linked lists. See DBX_HEAD, this is the structure identified by Simon Craythorn (thanks). All message data has flags == 0x200. To date I've only seen these allocated in 0xF780 byte block lengths. TOC Pointer list: Contains pointers to the table of contents data, ie offsets to DESC_HEAD locations. I believe this is some form of doubly linked list, but don't have a big enough file to be sure. See TOC_HEAD which is a super set of DBX_HEAD, appears to have flags == 0x0. To date I've only seen these allocated in 0x3e1c byte block lengths. Typically this will leave space for 25 entires of length 0x27C which is the apparent size of an individual TOC_HEAD entry with its associated data. The extremely useful tip Simon gives is that a valid header only occurs on a modulo 4 byte boundry. The unsigned long the begins such a header is equal to the offset in the file of this header. Hence one searchs for headers by looking for longs with file offsets equal to the value at that location. Outlook appears to allocate fairly large blocks for each type of data. A *.dbx file will always has a file header region. As soon as there are one or more messages it will also have the three other regions described above. I don't see a pattern to the order things are allocated, sometimes the Message data is the next available location, 0x2ad4, and sometimes it will be the Descriptive data. However one can determine some of this from key offsets in the File header region. When one of the previous allocated blocks fills, an additional block is allocated at the end of the file allowing the system to grow. File Header: I've identified the following offsets at the beginning of the file that provide data about the rest of the file: Offset Description 0x0 1st 16 bytes probably a file GUID flag, see shortcut.pdf Same bytes for all message files, but not for all *.dbx. 0x24 points to current block of what I'm calling Descriptive Data, TOC data and flag 0x48 above entries appended in this block Typically 0xC000 bytes reserved for TOC data on first wack 0x28 is allocation length for each Descriptive Data block = 0xC000 0x30 points to TOC current header list block, header with flag == 0 may be 0 if no message, ie no TOC Each flag = 0 block 0x27c bytes long, 1 block contains 25 of these May be more data after this, see inbox.dbx, and offset 0x7c 0x34 is allocation length for each stored at offset = 0x3e1c bytes 0x3C points to active Message Data block header. Typically DBX_HEAD.flags == 0x200 This is where one adds next message Allocation seems to be 0xF780, TOC list records use a 3 byte offset, ie don't see how to exceed offset of 0xFFFFff. but this allows files up to 16.77 Megs, then what happens? 0x40 is allocation length for Message Data size = 0xF780 0xC4 active messages in file <= value at offset 0x5C 0x5C looks like total messages in file (including deleted) 0x7C points to what might be next available block? typically near EOF, but zeros in this area in all files that I've seen. 0x88 points to string describing flags in file all *.dbx with messages seem to have one of these... its a string per below in all cases where exists ie if 0x88 != 0, there are messages in file? defines some flags for the file. Note that DBX_HEAD.flags value 0x48 is its length! 0xE4 points to master TOC pointer list block, header with flag == 0 may be 0 if no messages, ie no TOC may be eqaul to value at offset 0x30 if only one such block Message Data: One should enter the first record in this linked list via the data following a TOC_ENT in the Descriptive Data which is in turn pointed to by the data following TOC_HEAD entries. typedef struct _dbx_head { DWORD lpos, // current position flags, // identifies type of data length, // of section next; // link to next section } DBX_HEAD; This is a linked list structure. I've only seen the flag 0x200 in this block. A series of links are tied together through the next offset. When next==0 you're at the end of the data. Length is normally, but not always, equal to 0x200. I think longer blocks may incorporate deleted messages, but not sure. If you enter the linked list data throught a TOC entry you always seem to get a series of nodes of length 0x200. The last node in the list will have a next == 0x0 and a length <= 0x200 (the last message record is rarely 0x200 bytes long. However if one scans ahead to find the next message header, you have to skip 0x200 bytes after the previous one to get to the next so message headers seem to be located at 0x210 byte intervals in the Message Data blocks. TOC Pointer List: I'm not sure about this one, its something like this, but I don't have big enough files to be sure. // TOC pointer header typedef struct _toc_head{ DWORD lpos, // offset equal to position in file flags, // always 0? prev_ptr, // back pointer for doubley linked list next, // next pointer count, // see shift required! unknown; // used as filler in dbx.c during read } TOC_HEAD; Its easy to spot cause the flags above = 0x0. Per above, believed to be a doubley linked list. File header offset 0x30 points to the TOC pointer currently in use, ie the one with the most recent entries. File header offset 0xE4 points to a master entry which may be the same as the value at offset 0x30 (if there is only one pointer block). If TOC_HEAD.prev_ptr !=0 there are additional pointer blocks which preceed the master entries. If I'm right about this it could be a true doubley linked list if the current block was linked in, but in my examples it isn't. You get to the current entry from file header offset 0x30. This header has a forward pointer, next, to the master entry which is also pointed to by file header offset 0xE4. The master entry my have no back pointer, prev_ptr = 0. If its not 0, then the entry pointed to will in turn point back via its next. I assume but don't have a file that proves that if there are 4 or more entries in the TOC pointer data (TOC_HEAD.flags == 0) that they will be connected via the prev_ptr and next fields. The current logic assumes one reads the TOC_HEAD structure. This gives one the number of data entries that follow. CAREFUL, to get the count left shift by 8 bits, the low byte of this long has been zero in all the files I've seen. It may really be a flag as in the data following DESC_HEAD below. Each data entry has 3 longs. Of the three, only the function of the first it known. Its the offset of a TOC_ENT record. In the only example I have, the last data entry in the master block pointed to by file header offset 0xE4 has non-zero data entries at the 2nd and 3rd location which are respectively the offset of the current TOC_HEAD block and the number of entries it contains. Hard to say all this is text, see dbx.c. Descriptive Data: I know what two types of this data are which is enough to parse a message file. All data in these blocks start with a DESC_HEAD. Note rather than a flags value as in the DBX_HEAD, the second long is the length. You get to an individual entry via the offset in the TOC pointer list described above. typedef struct _desc_head { DWORD lpos, length; } DESC_HEAD; The simplest entry is a string describing some flags for the file (no idea what it means). Its pointed to by file header at offset 0x88. Its a single NUL terminated string of length = DESC_HEAD.length-1. The important entry is the table of contents, TOC, entry data. This needs a lot more work, but I see enough to get one from here to the Message data. #define TOC_TYPE 0x1f // mask for TOC entries to get type # #pragma pack(1) // an alternate view of longs in structure typedef struct _toc_ent { BYTE flag; // some sort of bitmap WORD data; // if flag == 0x84 offset to message, often in 7th long // its in 3 bytes, ie up to 16.77 mb file BYTE extra; // high byte of data? } TOC_ENT; #pragma pack() The data following the DESC_HEAD is variable. After examining a number of records I found that the first byte in the longs following the header is a flag which indicates the function of the remaining bytes in the long. I don't know the meaning of the high order bits, but masking this byte with 0x1F produces monotonically increasing values in the range 0x0 to 0x1C. I only recognize two of them, several always seem to be present {0x1,0x2,0x4-0x6,0xC-0xD, 0x10-0x14,0x1A-0x1C} and some I've never seen {0x3,0x9,0xA,0xB, 0xF,0x15,0x1c}. flag & TOC_TYPE description 0x1C last in list of longs, the high order byte(s?) contain the length of the following string data, ie length to skip to next region. 0x04 the high order three bytes represent the file offset of the first header, DBX_HEAD, for this message in the Message Data block. Note I have some concern about this, what happens if the file is longer than the 16.77 megabytes offset one can store in 3 byte binary location? After finding a flag with type 0x1C the table of contents data extracted from the message is stored in the following order: 8 bytes, a quad word, Win32 FILETIME = time message recieved NUL terminated string, contains subject data 8 bytes, a quad word, Win32 FILETIME = time message last accessed? NUL terminated string, contains subject data (again) NUL terminated string, contains mail server name NUL terminated string, contains "From:" name NUL terminated string, contains "To:" name Sample Code: I wrote a test program dbx.c to validate these concepts. A self-expanding LHarc archive is available at http://www.conknet.com/~w_kranz/wdbx.com It contains the following: Listing of archive : 'WDBX.COM' Name Original Packed Ratio Date Time Attr Type CRC -------------- -------- -------- ------ -------- -------- ---- ----- ---- DBX-FMT.TXT 17040 7320 43.0% 0-09-10 11:03:40 a--w -lh1- E5B1 DBX.C 22946 7473 32.6% 0-06-22 15:18:32 a--w -lh1- A275 DBX.EXE 34934 16104 46.1% 0-06-22 15:00:06 a--w -lh1- 8BA7 -------------- -------- -------- ------ -------- -------- 3 files 74920 30897 41.2% 0-09-10 11:03:44 Its a MSDOS mode 16 bit program. It really just demostrates what I'm talking about and lets one validate the format of a file. The command line arguments are displayed as indicated below if it is executed with no arguments: The terminating 0 byte is always there, but some of these strings may be empty. The length associated with to flag type 0x1C is the number of bytes to skip to get to the location immediately following the NUL for the final "To:" string above. More binary data follows whose purpose is unknown. usage: dbx [-b] [-c#] [-d] [-f] [-m[#]] [-t] optional args mutually exculsive, default dumps all headers -b display known block allocations in file -c# dump contents TOC list block at hex offset = # points to -d just displays gross file composition -f# scan for headers with a specific flag value, # in hex -m scan for message blocks flag = 0x200 and count messages -m# display message data from block starting at hex # -t to test for TOC, lists all block entries All program output goes to standard output, you must redirect it to a file if you want to capture it. The default option displays all headers, ie 4 byte long locations in file that are equal to the file offset at that location. They are displayed as if they were all DBX_HEAD, if you are in the Descriptive Data section the flags is really the length, and the other two entries should be ignored. -f# is similar to the the default mode, but only searches for the specified flag value. Nice to find where flags=0x0 are located. -m is simialar to -f200, but has some extra logic to detect if message blocks are extra long, or not contiguous. Finds all DBX_HEAD.flags == 0x200. -m# traces a single message starting at the offset given and continuing via the DBX_HEAD.next pointer until its 0. Just the DBX_HEAD data is displayed to trace a single message. -b simplistically shows how blocks were allocated. It does not check the file header for allocation sizes, but assumes the fixed sizes I've seen in the past. -d looks at and displays information from the fixed offsets in the File header block. -t test TOC pointer list. Assumes doubley linked list format described above and displays offsets to all DESC_HEAD entries found in the pointer lists. -c# displays DESC_HEAD data for one message. It must be given one of the offsets from the -t option above and parses the descriptive strings from the appropriate location. The file offset of the starting DBX_HEAD in the Message Data block is also show. Unknown: A lot. The above is a step toward the entire format, but there are big holes. Using the -m option one finds message chains that jump around, sometimes going backwards in file, then forward again. Pretty clearly the system can regain space lost in deleted messages. There must be an available block list somewhere. I suspect its near or in the TOC pointer list and the DBX_HEAD.flag == 0x0 area. I'm not sure I've got the organization of the TOC pointer list right. I'd love to have someone with some larger files check this, or send me a copy of their region which I call the TOC pointer list where DBX_HEAD.flag == 0x0.