.WRI Write File Format This topic describes the binary file format used by Microsoft Write. A Write binary file contains information about file content, text and pictures (including object-linking-and-embedding, or OLE, objects), and formatting. (Some stuff seems to be missing, so I've added it. Comments to sean@mess.org please.) Write-File Header The Write-file header describes the content of the file. It contains data, pointers to subdivisions of the formatting section, and information about the length of the file. The file header has the following form: Word Name Description 0 wIdent Must be 0137061 octal (or 0137062 octal if the file contains OLE objects) 1 dty Must be zero 2 wTool Must be 0125400 octal 3 Reserved; must be zero 4 Reserved; must be zero 5 Reserved; must be zero 6 Reserved; must be zero 7-8 fcMac Number of bytes of actual text plus 128, the bytes in one sector (low-order word first) 9 pnPara Page number for start of paragraph information 10 pnFntb Page number of footnote table (FNTB) or pnSep, if none 11 pnSep Page number of section property (SEP) or pnSetb, if none 12 pnSetb Page number of section table (SETB) or pnPgtb, if none 13 pnPgtb Page number of page table (PGTB) or pnFfntb, if none 14 pnFfntb Page number of font face-name table (FFNTB) or pnMac, if none 15-47 szSsht Reserved for Microsoft Word compatibility 48 pnMac Count of pages in whole file (last page number plus 1) In the preceding list, a "page number" means an offset in 128-byte blocks from the start of the file. For example, if pnPara equals 10, the paragraph information is at offset 10*128 = 1280 in the file. The starting page number of character information (pnChar) is not stored but is computable, as follows: pnChar = (fcMac + 127) / 128 Examining the value of word 48 of the header is a good way to distinguish Write files from Microsoft Word files. If pnMac equals zero, the file originated in Word. Any other value identifies a Write file. Text and Pictures After the header comes information about text and pictures. This information constitutes a separate section of the file. Text The text of the Write file starts at word 64 (page 1). Write uses the Windows character set (except for the pictures in the file) as well as the following special characters: o ASCII character codes 13, 10 (carriage return, linefeed) for paragraph ends. No other occurrences of these two characters are allowed. o ASCII character code 12 for explicit page breaks. o ASCII character code 9 (normal) for tab characters. o ASCII character code 31 for the soft hyphen. Other line-break or wordwrap information is not stored. Pictures Pictures (including OLE objects) are stored as a sequence of bytes in the text stream. These bytes can be identified as picture information by examining their paragraph formatting. One picture is exactly one paragraph. Paragraphs that are pictures have a special bit set in their paragraph property (PAP) structure. For more information on the PAP structure, see Section 8.3, "Formatting." (note: Write that comes with Windows 3.0 uses the picture stuff below, and does not support OLE; Write that comes with Windows 3.1 always uses OLE, but can read the picture stuff below. Proof of this is that if you paste a picture into Write 3.1 (and thus it is OLE) you get an extra option in Save As; you get the possiblity to save it for Write 3.0. If you choose this it will say that all OLE objects will be removed in the file. Also I have been unable to paste pictures with colour into Write 3.0, it always seems to convert it to monochrome; as a result of that, bmPlanes and bmBitsPixel are always 1.) Each picture consists of a descriptive header followed by the data that makes up the picture. The header for OLE objects is different from the one used for pictures. The picture header has the following form: Byte Name Description 0-7 mfp Windows METAFILEPICT structure (hMF member undefined) 8-9 dxaOffset Offset of picture from left margin, in twips (1/1440 inch) 10-11 dxaSize Horizontal size, in twips 12-13 dyaSize Vertical size, in twips 14-15 cbOldSize Number of following bytes (actual metafile or bitmap bits); set to zero 16-29 bm Additional information for bitmaps only 30-31 cbHeader Number of bytes in this header 32-35 cbSize Number of following bytes (actual metafile or bitmap bits), replacing cbOldSize for new files 36-37 mx Scaling factor (x) 38-39 my Scaling factor (y) 40-? cbHeader Picture contents, through cbHeader+cbSize-1 The mm member (bytes 0-1) of the METAFILEPICT structure specifies the mapping mode used to draw the picture. The last set of bytes will be bitmap bits if the value of the mm member is 0xE3. This is a special value used only in Write. Otherwise, the bytes will be metafile contents. If the picture has never been rescaled with the Size Picture command in Write, the scaling factors in each direction will be 1000 (decimal). If the picture has been resized, the scaling factor will be the percentage of the original size that the picture is now, relative to 1000 (100 per cent). For information about the METAFILEPICT structure and bitmaps, see the Microsoft Windows Guide to Programming and the Microsoft Windows Programmer's Reference, Volumes 1 and 3. (added note:) The METAFILEPICT structure looks like: Word Name Description 0 mm 0xe3 for bitmap, metafile otherwise 1 xExt Horizontal size, Word uses this in stead of dxaSize 2 yExt Vertical size, Word uses this in stead of dyaSize 3 hMF Handle to metafile, not used in Write. If the contents is a bitmap, the bm member is a BITMAP structure, which looks like: Byte Name Description 0-1 bmType "BM" for bitmaps, not used in Write 2-3 bmWidth Width in pixels 4-5 bmHeight Height in pixels 6-7 bmWidthBytes Width in bytes, rounded up on two-byte boundary 8 bmPlanes Number of bit planes 9 bmBitsPixel Number of bit per pixel 10-13 bmBits A void FAR* pointer to the data, not used in Write If the mm member has value 0x88, the file is a metafile (.wmf file). The bm member is empty, but the other members have values like normal. Colour wmf files exist. (end of added note) The descriptive header for OLE objects is similar to the one used for pictures. The OLE object header has the following form: Byte Name Description 0-1 mm Must be 0xE4 2-5 Not used 6-7 objectType Type: 1=static, 2=embedded, 3=link 8-9 dxaOffset Offset of picture from left margin, in twips (1/1440 inch) 10-11 dxaSize Horizontal size, in twips 12-13 dyaSize Vertical size, in twips 14-15 Not used 16-19 dwDataSize Number of bytes in the object data that follows the header 20-23 Not used 24-27 dwObjNum Hexadecimal number that, when converted to an 8-digit string, represents the object's unique name 28-29 Not used 30-31 cbHeader Number of bytes in this header 32-35 Not used 36-37 mx Scaling factor (x) 38-39 my Scaling factor (y) 40-? cbHeader Object contents, through cbHeader+dwDataSize-1 The scaling factors for OLE objects work the same way as they do with pictures. (added note:) I couldn't find any information on the OLE objects. There is a libole2, which only works for OLE2 as far as I can see. OLE2 is an entire file-system, while OLE1 (as used here) is only one object. The following is entirely reverse-enigineered, and therefore might not be correct. The OLE object always starts with a DWORD with value 0x501, followed by another DWORD is the objectType as above, only with reverse values: 3 = static, 2 = embedded, 1 = link. Next comes a DWORD which gives the length of the typename, which is immediately followed by that typename. It is a zero-terminated ascii string, and the length includes the 0 at the end. Static OLE Object Note that a static OLE object isn't really an OLE object; it is simply a picture which is rendered by Write itself. See: http://support.microsoft.com/support/kb/articles/Q88/1/16.ASP If the objectType is static, the typename has one of the following values: DIB METAFILEPICT BITMAP As usual, the data following that is not the stuff you would expect. The headers are garbled. DIB A dib (Device Independant Bitmap, a bmp file) usually has the following structure: BITMAPFILEHEADER bmfh; BITMAPINFOHEADER bmih; RGBQUAD aColors[]; BYTE aBitmapBits[]; In the DIB which is stored in Write, the BITMAPFILEHEADER is missing. After the string "DIB" (and the 0 terminator), comes the following bytes: 0xb2 0x18 0x00 0x00 0x29 0xec 0xff 0xff, followed by a DWORD which is the size of the dib _without_ the BITMAPFILEHEADER. After that the BITMAPINFOHEADER follows. You must fill the members of the BITMAPFILEHEADER yourself; you can use the ColorsUsed to calculate the OffsetBits member. (However, I have one instance of a Write file where this member is 0, although it is a 4 bit image. Maybe BitCount is a better member to use.) BITMAP This is the Device Dependant Bitmap (DDB), which is an insane format IMHO as the palette information is not stored. If the image is monochrome, the colours are of course black and white; if it is 4-bits, use the Windows colours; if it is 8-bit, the first 8 and last 8 colours in the palette are Windows colours, but the other colours depend on what colour the palette has at that moment. The data is stored in the BITMAP structure just as above (for Write 3.0 images). After the "BITMAP" string (with the 0 terminator) comes the following bytes: 0xb4 0x18 0x00 0x00 0x28 0xec 0xff 0xff Followed by the size in in DWORD; next comes with BITMAP structure with the bmType and bmBits members undefined, followed by the uncompressed bits. METAFILEPICT This is a Windows metafile (wmf). For reasons unknown Write (or Windows?) converts some images to metafiles. I have no idea how this is stored. It seems to be followed by these bytes: 0x4f 0x03 0x00 0x00 0xb1 0xfc 0xff 0xff Then the size of the metafile in a DWORD; next comes the METAFILEPICT structure (defined above) again with hMF and mm members undefined. After that the metafile bits follow, but without a header. Embedded OLE Object The typename is the name of the executable, with the exe extension. For Paintbrush it is "Pbrush" for example. The typename is followed by the filename. First there is a DWORD with the length (including the 0 at the end of the string), and the string itself. If the length is 0, there is no string (so not even a 0 for an empty string). After that comes a parameter, for example the size of a picture in a string: "0 0 320 240". I don't know what use this has but it's there. Just like with the filename, first there is a DWORD with the length of the string, and then the string itself (if the length is non-zero). Last comes a DWORD with the offset to the next part of the OLE Object, followed by the data of the file itelf. That length is enough information on the length of the file, but it seems to be padded with crap; I have no idea how to acquire the length of the file without looking at the file itself (note that this depends on the type of file). The data itself is really the file. For example for Paintbrush this would simply be a .bmp file, so it would start with "BM". Also note that some files cannot be read; if you use Paint Shop Pro for embedded objects, the file cannot be read into Paint Shop Pro when you extract it manually (so all of this is application specific). After the file (add the offset to the byte after the DWORD where the offset is stored) comes the next part. Again this works like the whole OLE stream all over again, but with a difference: if the objectType is 0, there is nothing any more. If it is 5, it probably means "alternative display," like the Sound Recorder icon if the file was a .wav file. Link OLE Object This type is supposed to the type where the actual data is somewhere else; the filename points to the data of the file. It works very much like the embedded OLE Object type. Suppose you have a Paintbrush OLE Object, type link. The filename is "C:\WINDOWS\WINLOGO.BMP". The first part is stored as with embedded stuff, but after the parameter (which would be "0 0 320 240" in this case), there are 12 bytes padding and then the next OLE object. This could very well be the actual picture again as a embedded OLE object. However if a link is stored as a link OLE Object, the next OLE object will be the Sound Recorder icon. Formatting Write files contain both character and paragraph formatting information. There can be no gaps in either; each must begin with the first text character (byte 128) and continue through the last. The format descriptors (FODs) for the first and last paragraph must, therefore, have the value of fcLim equal to the value of fcMac, as defined in the header section. (note: Write 3.0 sometimes saves a fcLim > fcMac, you have to check for this!) There is a difference between paragraph and character FODs. A character FOD may describe any number of consecutive characters with the same formatting. However, there must be exactly one paragraph FOD for each text paragraph. In either case, it is advisable to have multiple FODs point to the same formatting properties (FPROPs) on a given page because it saves space in the file. No FOD may point off its page. Characters and Paragraphs Both the character and paragraph sections are structured as a set of pages. Each page contains an array of FODs and a group of FPROPs, both of which are described later in this section. Following is the format of a page: Byte Name Description 0-3 fcFirst Byte number of first character covered by this page of formatting information; equals 128 for first character in the text (low-order byte first) 4-n rgfod Array of FODs n+1-126 grpfprop Group of FPROPs 127 cfod Number of FODs on this page An FOD is fixed in size. It contains the byte offset to the corresponding FPROP. Following is the structure of an FOD: Word Name Description 0-1 fcLim Byte number after last character covered by this FOD 2 bfprop Byte offset from beginning of FOD array to corresponding FPROP for these characters or this paragraph (note: sometimes bfprop is 0xffff; it seems that that means that the CHP or PAP has the default values.) An FPROP is variable in size. It contains the prefix for a character property (CHP) or paragraph property (PAP), both of which are described later in this section. Following is the structure of an FPROP: Byte Name Description 0 cch Number of bytes in this FPROP 1-n rgchProp Prefix for a CHP (for characters) or a PAP (for paragraphs) sufficient to include all bits that differ from the default CHP or PAP Following is the format of a CHP: Byte Bit Name Description 0 Reserved; ignored by Write 1 0 fBold Bold characters 1 fItalic Italic characters 2-7 ftc Font code (low bits); index into the FFNTB 2 hps Size of font, in half points (standard is 24) 3 0 fUline Underlined characters 1 fStrike Reserved; ignored by Write 2 fDline Reserved; ignored by Write 3 fOverset Reserved; ignored by Write 4-5 csm Reserved; ignored by Write 6 fSpecial Set for "(page)" only 7 Reserved; ignored by Write 4 0-2 ftcXtra Font code (high-order bits, concatenated with ftc) 3 fOutline Reserved; ignored by Write 4 fShadow Reserved; ignored by Write 5-7 Reserved; ignored by Write 5 hpsPos Position: 0=normal, 1-127=superscript, 128-255=subscript If the user doesn't select any special character properties, the CHP is filled with the following default values: Byte Value 0 1 2 24 3-5 0 Each character FPROP must, therefore, have a count of characters (cch) greater than or equal to 1. Each PAP can contain up to 14 tab descriptors (TBDs), which are described later in this section. Following is the structure of a PAP: Byte Bit Name Description 0 Reserved; must be zero 1 0-1 jc Justification: 0=left, 1=center, 2=right, 3=both 2-7 Reserved; must be zero 2 Reserved; must be zero 3 Reserved; must be zero 4-5 dxaRight Right indent, in 20ths of a point 6-7 dxaLeft Left indent, in 20ths of a point 8-9 dxaLeft1 First-line left indent (relative to dxaLeft) 10-11 dyaLine Interline spacing (standard is 240) 12-13 dyaBefore Reserved; ignored by Write (standard is zero) 14-15 dyaAfter Reserved; ignored by Write (standard is zero) 16 0 rhcPage 0=header, 1=footer 1-2 Reserved; 0=normal paragraph, nonzero=header or footer paragraph 3 rhcFirst Start of printing: 1=print on first page, 0=do not print on first page 4 fGraphics Paragraph type: 1=picture, 0=text 5-7 Reserved; must be zero 17-21 Reserved; must be zero 22-78 Tab descriptors (up to 14) Following is the format of a TBD: Byte Bit Name Description 0-1 dxa Indent from left margin of tab stop, in 20ths of a point 2 0-2 jcTab Tab type: 0=normal tabs, 3=decimal tabs 3-5 tlc Reserved; ignored by Write 6-7 Reserved; must be zero 3 chAlign Reserved; ignored by Write If the user doesn't select any special paragraph properties, the PAP is filled with the following default values: Byte Value 0 61 2 30 10-11 240 (word) 12-78 0 Each paragraph FPROP must have a count of characters (cch) greater than or equal to 1. Footnotes Write documents do not have footnote tables (FNTBs), so pnFntb is always equal to pnSep. In fact, all their header and footer paragraphs appear at the beginning of the document before any normal paragraphs. When reading files created by Word, Write recognizes only those headers and footers that appear at the beginning of the document; it treats all others as normal text. Sections A Write document has only one section. If the section properties of a Write document differ from the defaults, the document contains a section property (SEP) section and a section table (SETB) section. If not, then neither section is present and pnSep and pnSetb are both equal to pnPgtb. Following is the format of an SEP: Byte Name Description 0 cch Count of bytes used, excluding this byte (all properties at byte positions greater than cch are set to their default values) 1-2 Reserved; must be zero 3-4 yaMac Page length, in 20ths of a point (default is 11*1440=15840) 5-6 xaMac Page width, in 20ths of a point (default is 8.5*1440=12240) 7-8 Reserved; must be 0xFFFF 9-10 yaTop Top margin, in 20ths of a point (default is 1440) 11-12 dyaText Height of text, in 20ths of a point (default is 9*1440=12960) 13-14 xaLeft Left margin, in 20ths of a point (default is 1.25*1440=1800) 15-16 dxaText Width of text area, in 20ths of a point (default is 6*1440=8640) (add note: this table is incomplete) Byte Name Description 1-2 Start page numbers at # if not 0xFFFF 19-20 yaHeader Distance from top to header (default is 0.75*1440=1080) 21-22 yaFooter Distance from top to footer (default is yaMac-0.75*1440=15760) (end of added note) The page length (yaMac) is equal to yaTop+dyaText. The page width (xaMac) is equal to xaLeft+dxaText+(right margin, not stored). If all the above properties are set to their defaults, no SEP or SETB is needed. Otherwise, the count of characters (cch) is greater than or equal to 1 and less than or equal to 16. The SETB section contains an array of section descriptors (SEDs), described later in this section. Following is the structure of an SETB: Word Name Description 0 csed Number of sections (always 2 for Write documents) 1 csedMax Undefined 2-n rgsed Array of SEDs plus zero-padding to fill the sector Following is the structure of an SED: Word Name Description 0-1 cp Byte address of first character following section 2 fn Undefined 3-4 fcSep Byte address of associated SEP A Write document always has exactly two SED entries. The cp value of the first entry indicates that it affects all the characters in the document. The fcSep value of the first entry points to the one SEP in the file. The second SED entry is a dummy with fcSep set to 0xFFFFFFFF. The PGTB section (optional) is on the page immediately after the SEP section. (added note: AFAICS these are not used in Write.) Note: The term "page" used in the rest of this section refers to printed pages of a Write document, not 128-byte "pages" of a disk file. The page table (PGTB) contains an array of page descriptors (PGDs), which are described later in this section. Following is the structure of a PGTB: Word Name Description 0 cpgd Number of PGDs (1 or more) 1 cpgdMac Undefined 2-n rgpgd Array of PGDs plus zero padding to fill the sector Following is the structure of a PGD: Word Name Description 0 pgn Page number in printed Word documents 1-2 cpMin Byte address of first character on printed page Font Table The font face-name table (FFNTB) contains the number of font face names (FFNs) and a list of FFNs. Following is the structure of an FFNTB: Byte Name Description 0-1 cffn Number of FFNs 2-n grpffn List of FFNs Following is the structure of an FFN: Byte Name Description 0-1 cbFfn Number of bytes following in this FFN (not including these 2 bytes) 2 ffid Font family identifier (see below) 3-(cbffn+2) szFfn Font name (variable length; null-terminated) A cbFfn value of 0xFFFF means that the next FFN entry will be found at the start of the next 128-byte page. A cbFfn value of zero means that there are no more FFN entries in the table. Possible values for ffid are FF_DONTCARE, FF_ROMAN, FF_SWISS, FF_MODERN, FF_SCRIPT, and FF_DECORATIVE. These constants are defined in WINDOWS.H. Additional values may be added to the list in future versions of Windows. (added note) These are the definitions taken from WINDOWS.H: #define FF_DONTCARE 0x00 /* Don't care or don't know. */ #define FF_ROMAN 0x10 /* Variable stroke width, serifed. */ #define FF_SWISS 0x20 /* Variable stroke width, sans-serifed. */ #define FF_MODERN 0x30 /* Constant stroke width, serifed or sans-serifed. */ #define FF_SCRIPT 0x40 /* Cursive, etc. */ #define FF_DECORATIVE 0x50 /* Old English, etc. */