1. Header Block
2. Big Block Depot
3. Big Data Blocks
3.1 Small Block Depot
3.2 Property Set Storage
3.3 Property Storage
3.4 Where Are The Small Data Blocks?
3.5 Property Sets
4. Trash Blocks
Table 1: LAOLA Header
Table 2: Property Storage
When looking at the document binaries I soon got confused. At a first glance
the documents file format seemed to differ very much from that of files stored
with Word 2. When looking more closely it got clear that some very familiar
binary pieces was stored within that new looking data. In fact a Word 6
document is somewhat a Word 2 format document stored among additional data.
These additional data makes out a kind of file system.
As far as I know there is no publically available information source on how
this file system (document format) works. Generally I demand that producing
industry has the duty to show what ingredients are in the products. In case
of authoring systems I think that people ought to know what kind of
information is in their (may be public distributed) documents. E.g., if
documents contain information about the creation date, printer(s), directory
structure or serial numbers. Or even worse, if documents contain or might
contain other private data.
Summarizing this topic for the below described file format, it always
stores the last modification date of objects. Because of a bad
implementation for Windows 3.x and older distributions of 32 bit Windows
systems it still always contains some "data trash" sections. These sections
might contain personal data. Of course, depending on the authoring program,
other private data might be stored invisible, too.
This text does *not* explain how a Microsoft Word file is structured.
This text does explain how the file system works that younger Windows
programs like Microsoft Word use to store their documents. So actually it
should be called OLE file system, as the philosophy behind this file system
is Microsoft's OLE / Com technology. But in lack of any binary level
technical specification about this topic my explanation might differ in
some cases or even be wrong. In this cases I certainly would not explain
the OLE filesystem, but something similar. So I decided to take a similar
name, either. The name is LAOLA.
Copying.
This file and the here referenced source codes are distributed under the
terms of Version 2 of the GNU General Public License from June 1991. If you
have no copy you should find one here.
Diese Veröffentlichung erfolgt ohne Berücksichtigung eines eventuellen
Patentschutzes. Warennamen werden ohne Gewährleistung einer freien
Verwendung benutzt.
Actually this work could have being done well by promoting some of the
popular archives like the whole "zip" family. But Microsoft went their own
way. As I think not only because of their market philosophy, but also
because it seems, the intention to develop a file system has been directed
by their "OLE philosophy", that in a way demands to have a hierarchical
file structures.
Unfortunately Microsoft did not include mechanisms to assure the well
being of a document. So, if e.g. a Laola document gets corrupted, this
normally stays undetected. If somebody tampers with the document, normally
nobody can notice. If a document contains much unused space, this normally
will stay so. Microsoft's strategy involved no compensation for the
disadvantages of a new file system.
What is a Laola file?
In short, a Laola file is an archive. The archive can maintain files and
directories. Each archive entry has a 0x80 bytes long info block. To store the
files the archive maintains a list of big data blocks and a list of
small data blocks. Files with a size less than 0x1000 bytes will be stored
into the small data blocks, the other files into big data blocks.
Data types.
Laola files have three basic data types:
Blocks.
In a first step each Laola file is divided into 0x200 (512) byte "big
blocks", so each Laola file's size is a multitude of 0x200 bytes. Each
block corresponds to an enum, starting with -1 for the first 0x200 bytes,
then counting upwards. So the file is made out of the set of blocks:
Example:
How to read a block chain.
Starting at position 1. So the first block belonging to root is block 1. The
value of the big block depots entry with position 1 is: 0x00000005. So the
next block belonging to root is block 5. The value of the big block depots
entry with position 5 is 0xfffffffe (-2). That means: here is the end of
the chain. So "root" finally consists out of the blocks: {1, 5}
From the header also the value of variable $sbd_startblock is known. Try
to find it's value, then try to get the belonging block chain! (If you
want to see the solution, look at the end of this document)
Note: when reading in a chain, only the values "0 .. $maxblock" and "-2"
are ok. If other values do occur in a chain, some error happened.
Note: The small block depot may be absent. In that case $sbd_startblock
is 0xfffffffe (-2).
Summary: with the help of header block and big block depot the values of
the big block lists: @root_list and @sbd_list are known.
Each pps can have a successor and a predecessor. Each pps also can be a
directory (or "storage"). $pps_prev, $pps_next, $pps_dir refer to the
ascending number of the 0x80 blocks as mentioned above. So in the example
the pps starting at 0x400 gets the handle (number) 0, the pps starting
at 0x480 gets the handle 1, the pps starting at 0500 gets the handle 2
and so on. When read skillfully, an ordered listing of pps results in the
end (look at function get_pps_chain in "laola.pl").
Property types.
Each pps has a type out of this three:
If $pps_size is not zero, $pps_sb points to the starting block of the
belonging property. The starting block refers to the big block depot,
if $pps_size is greater or equal 0x1000 (4096) bytes. If the property's
size is smaller, $pps_sb refers to the small block depot. There is one
exception: $pps_sb of the Root entry (always pps 0) does always refer
to the big block depot.
It now is easy to read the "files" in: the big or small block list has
to be catched (as did before with root_list and sbd_list) from the big
or small block depots, the so referred blocks have to be read and at the
last step the size might have to be truncated to fit to $pps_size.
If the type of a pps is root or storage, at least the variables $pps_ts2d
and $pps_ts2s get initialized. Together these variables build a 64 bit
integer variable, that represents time and date. This variable counts
all 10^-7 seconds, starting at 01/01/1601 00:00.
If the type is root, $pps_sb is pointing to the first big block of the
small block list @sb_list. See just below:
Summary and further information are
- still to be done ! -
Some blocks are just partially consisting out of trash, they could be called
stinky blocks. This is because of the size of a property does just by
chance fit exactly to the 0x200 (0x40 at small blocks) bytes border. So the
rest of the last block of a chain does nearly always contain some
bytes of rubbish.
Like all trash, data trash is troublesome. In some cases it is simply
annoying because it blows up the files size. In each case it is relevant
with reference to data security. Because you cannot know what's in there
you have a lack of control to your own data. In Usenet has been even
reported that private mail and encrypted password happened to stick in
a Word file.
The good thing is, that unlike nuclear trash, this data trash is removable.
Just look at demonstration program "lclean" at the source code section of
this document. In case of data trash I've heard, that Microsoft knows
about this OLE bug and provides a fix for 32 bit Windows systems. However,
if you use Windows 3.1 you probably have to rely on lclean.
Comments appreciated!
Martin
- The End -
A - Preface
One day I started writing some program that should have access to documents
done with Microsoft Word for Windows 6. I wanted to keep it portable, so it
was necessary not to use methods specific to operating systems. So I decided
to learn to understand the binary structure of the documents.B - General
Digital documents tend to consist of more than just one file. When storing
or exchanging such documents it used to be a problem to bind all those
files together and to be sure about the same hierarchy conventions. LAOLA
file format allows to store files in a hierarchical order into a single
file. This may contain one or more directories, that each may contain
one or more files and directories.C - Basics
No guaranty!
The things here described are assumptions, mainly based on speculation and
experiment and only few on documentation. Although I'm sure things are
pretty ok, there might be errors!
1. 4 byte integers ("long") 0x12345678 -> 0x78 0x56 0x34 0x12
2. 2 byte integers ("word") 0x1234 0x5678 -> 0x34 0x12 0x78 0x56
3. 1 byte integers ("char") 0x12 0x34 0x56 0x78 -> 0x12 0x34 0x56 0x78
Integers are stored in "little endian" ("VAX", "x86") mode. That means they
are stored eightbitwise, the least significant byte first. Char streams are
stored first in, first out.
file <=> union of big blocks {-1, 0, 1 .. $maxblock},
($maxblock = (sizeof(file)-1) / 0x200 -1), $maxblock e {1, 2, .. }
Basic parts.
The big blocks divide the file into the four basic parts:1. Header Block
The header is built of the first block (block -1, running from file offset
0x00 to 0x1ff). The header starts with the eight bytes long hex string
{d0 cf 11 e0 a1 b1 1a e1}. The header contains some first structure
information. Function will be told later in this document.
00000: d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00
00010: 00 00 00 00 00 00 00 00 3b 00 03 00 fe ff 09 00
00020: 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
00030: 01 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00
00040: 01 00 00 00 fe ff ff ff 00 00 00 00 00 00 00 00
00050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff *
00: stream $laola_id identifier {d0 cf 11 e0 a1 b1 1a e1}
2c: long $num_of_bbd_blocks Number of big block depot blocks
30: long $root_startblock Root chain's first big block
3c: long $sbd_startblock small block depot's first big block
4c[]: long $bbd_list[i] array of $num_of_bbd_blocks big block numbers
(for detailed info look at: Table 1)
2. Big Block Depot
The big block depot manages the big blocks. Big blocks have a length of
exactly 0x200 (512) bytes. Often the big block depot will consist exactly
out of one big block.
big block depot <=> union of big blocks {bbd_list[i]},
bbd_list consists out of $num_of_bbd_blocks elements, stored from
position <header:4c> on.
Example:
00200: fd ff ff ff 05 00 00 00 fe ff ff ff 04 00 00 00
00210: 06 00 00 00 fe ff ff ff 07 00 00 00 08 00 00 00
00220: 09 00 00 00 0a 00 00 00 0b 00 00 00 fe ff ff ff
00230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff *
The big block depot is representing a table of block numbers, their index
starts with zero. Entry 0, in this example with the value 0xfffffffd (-3),
refers to block 0. Entry 1, with the value 0x00000005 refers to block 1.
Entry 2 refers to block 2 ... and so on. Each entry may have one of these
values:
0xfffffffd (-3) : this block is a special block
0xfffffffe (-2) : end of chain
0xffffffff (-1) : unused
0 .. $maxblock : next element of chain (a big block number)
$maxblock+1 .. : not defined
In the header the variable $root_startblock has been initiated, the
example gives the value 1 to it. These value tells which block is the first
in a chain of blocks belonging to the "root". In the example it would be
read out as follows:3. Big Data Blocks
3.1 Small Block Depot
The small block depot manages the small blocks. Small blocks have a length
of exactly 0x40 (64) bytes. Often the small block depot will consist exactly
out of one big block. Some documents do not have a small block depot at all.
small block depot <=> union of big blocks {sbd_list[i]},
sbd_list consists out of (number of chain_elements(sbd_list))
elements. The list is read out from the big block depot,
the lists start is $sbd_startblock (-> Section 2)
Example:
00600: 01 00 00 00 fe ff ff ff ff ff ff ff ff ff ff ff
00610: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
The entries of the small block depot do *not* refer to absolute positions
in the file, like the entries of the big block depot. They refer to the
positions in the (file made out of big blocks) according to the the big
block list @sbd. This list is denoted by the root entry of the property
storage. So first property storage has to be explained.
3.2 Property Set Storage
From section 2 the values of the big block list @root_list are known.
These blocks do contain the property set storage (ppss) blocks.
property set storage blocks <=> union of big blocks {root_list[i]},
root_list consists out of (number of chain_elements(root_list))
elements. The list is read out from the big block depot,
the lists start is $root_startblock (-> Section 2)
3.3 Property Storage
The ppss blocks (-> section 3.2) split into 0x80 byte blocks. These 0x80
blocks are building a property storage (pps). Every pps gets a number,
starting with zero. A pps refers to a "file" or it does represent a
"directory".
Example:
00400: 52 00 6f 00 6f 00 74 00 20 00 45 00 6e 00 74 00 R o o t E n t
00410: 72 00 79 00 00 00 00 00 00 00 00 00 00 00 00 00 r y
00420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00440: 16 00 05 00 ff ff ff ff ff ff ff ff 03 00 00 00
00450: 00 09 02 00 00 00 00 00 c0 00 00 00 00 00 00 46
00460: 00 00 00 00 00 00 00 00 00 00 00 00 86 29 f6 1f
00470: ad 57 bb 01 03 00 00 00 00 0f 00 00 00 00 00 00
40: word $pps_sizeofname size of $pps_rawname
42: byte $pps_type type of pps (1=storage|2=stream|5=root)
44: long $pps_prev previous pps
48: long $pps_next next pps
4c: long $pps_dir directory pps
74: long $pps_sb starting block of property
78: long $pps_size size of property
(for detailed info look at: Table 2)
The first 0x40 bytes are reserved for the name of the pps. The length of the
name stands in $pps_sizeofname. The name can be converted to an ASCII string
$pps_name just by removing every second char. In this example the length
of the name is 0x16, and in the end $pps_name is "Root Entry\00". The
C-style zero should be removed. If the case occurs that $pps_sizeofname
is zero, then this 0x80 block is no pps and has to be ignored.
3.4 Where Are The Small Data Blocks?
The root entry is an exception to the 0x1000 bytes rule of section 3.3.
The size in the example is 0xf00, so it actually should belong to the small
block depot. In fact the file of the root entry always refers to the big
block table. The start block here is 3. When examining this (copy from above):
00200: fd ff ff ff 05 00 00 00 fe ff ff ff 04 00 00 00
00210: 06 00 00 00 fe ff ff ff 07 00 00 00 08 00 00 00
00220: 09 00 00 00 0a 00 00 00 0b 00 00 00 fe ff ff ff
00230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff *
it results in the chain: {3, 4, 6, 7, 8, 9, a, b}. The so yielded chain
gives the blocks that provide the space for the small blocks. So small
block references refer to positions in the "small block file", that in
this example is build of the big block chain starting with block 3.
small data blocks <=> union of big blocks {sb_list[i]},
sb_list consists out of (number of chain_elements(sb_list))
elements. The list is read out from the big block depot,
the lists start is
$root_startblock -> property storage 0 -> pps_sb
With this in mind, everything needed to pull out some file out of a Laola
archive is available. You could test this with
"lls -s". But there is still more behind the thing.3.5 Property Sets
Apart from just storing plain files it is possible, to store special
structured "database" files. These structured files are called property sets.
There is a good article about how they are build at Microsoft's web service,
"OLE Property Sets Exposed" by Charlie Kindel.
It claims, that accurate information about property sets is provided with
the Win32 SDK, too.4. Trash Blocks
Last of the four chapters is "trash data blocks". These trash blocks are
blocks stored in the document without being referred by Laola system. There
should be no such trash, but there is sometimes. Pretty famous
about this is the Microsoft Word Option "fast saving" (switch this off, if
you haven't yet). Thus stored documents are usually consisting by about the
half out of garbage. Another example would be Star Writer 3.1, that by
principle produces 2 big blocks of trash.TABLES
Table 1: Block 0 (laola header)
offset type value const? function
00: stream $laola_id ! identifier {d0 cf 11 e0 a1 b1 1a e1}
08: long 0 . ?
0c: long 0 . ?
10: long 0 . ?
14: long 0 . ?
18: word 3b . ? revision ?
1a: word 3 . ? version ?
1c: word -2 . ?
1e: byte 9 . ?
1f: byte 0 . ?
20: long 6 . ?
24: long 0 . ?
28: long 0 . ?
2c: long $num_of_bbd_blocks ! Number of big block depot blocks
30: long $root_startblock ! Root chain 1st block
34: long 0 . ?
38: long 1000 . ?
3c: long $sbd_startblock ! small block depot 1st block
40: long 1 . ?
44: long -2 . ?
48: long 0 . ?
4c[]: long $bbd_list[i] ! array of $num_of_bbd_blocks big block
numbers
The rest of block 0 should be:
long -1 .
####
Table 2: Property Storage
offset type value const? function
00: stream $pps_rawname ! name of the pps
40: word $pps_sizeofname ! size of $pps_rawname
42: byte $pps_type ! type of pps (1=storage|2=stream|5=root)
43: byte $pps_uk0 ! ?
44: long $pps_prev ! previous pps
48: long $pps_next ! next pps
4c: long $pps_dir ! directory pps
50: stream 00 09 02 00 . ?
54: long 0 . ?
58: long c0 . ?
5c: stream 00 00 00 46 . ?
60: long 0 . ?
64: long $pps_ts1s ! timestamp 1 : "seconds"
68: long $pps_ts1d ! timestamp 1 : "days"
6c: long $pps_ts2s ! timestamp 2 : "seconds"
70: long $pps_ts2d ! timestamp 2 : "days"
74: long $pps_sb ! starting block of property
78: long $pps_size ! size of property
7c: long . ?
Solutions: