/*
This is chapter 16 from The File Formats Handbook, Copyright © 1995 Günter Born. It is copied from the Dr. Dobb's Essential Books on File Format CD-ROM. The pictures are the same bad quality as on the CD. Converted to valid html 3.2 by Sean Young.

The actual text is exported and the pictures crabbed by screen captures. The reference to the non-existing figure 16.2 and two figures with number 16.3 is [sic].
*/

CHAPTER 16: MS-WORD FORMAT

MS-Word for DOS was one of the most popular word processing programs. This chapter describes the file format for Word 4.0/5.0 files.

In versions 3.0, 4.0 and 5.0 of Word from Microsoft, a mixed ASCII/binary format is used for the text files. These files are divided into three parts:

Figure 16.1: Structure of an MS-Word file
Figure 16.1

The data is stored in the file in 128-byte blocks. Some internal pointers take the form of block numbers from which the offset to the first byte of the relevant block can be calculated, using the following formula:

Offset = Block number * 80H

Figure 16.2 shows a sample text produced using Word:

Figure 16.2: Text samples using MS-Word 4.0

Soft hyphenation     Soft hy-phen-ation             Alt -
Hyphenation 2        Hypen-ation                    Ctrl -
Paragraph block text
Bold on              Bold text                      Alt B
Normal on            Normal text                    Alt Spacebar
Italics on           Italic text                    Alt I
Underline on         Underlined text                Alt U
Underline double     Double-underlined text         Alt D
Small Capitals       Small Capitals                 Alt K
Strike out           Strike out text                Alt D
Superscript          Superscript text               Alt H
Subscript            Subscript text                 Alt T
Hidden               Hidden text                    Alt G
Centered                 Centered text              Alt Z
Left aligned         Left-aligned text              Alt L
Right aligned                 Right-aligned text    Alt R
Indent left 1.5 cm     Indented text                Alt M
Indent right         Indented text                  Alt V
Indent variable                                     Alt O
Standard paragraph
Indent for 1st line negative; all other lines of a
   paragraph are indented left.
Double line spacing                                 Alt 2
Capital              CAPITALS                       Alt +

If a block is not completely filled with data, the remainder of the block is undefined. The first block (block 0) of an MS-Word file contains an 128-byte header defining certain items of control information. This is followed by the blocks containing the actual text. With very few exceptions, these do not contain control codes. If there is no text, these blocks are omitted. The trailer consists of a number of blocks containing the format information for the text. A hex-dump of the corresponding Word file is shown in Figure 16.2.

Figure 16.3: Hex-dump of an MS-Word 4.0 file
Figure 16.2

Figure 16.3: Hex-dump of an MS-Word 4.0 file
Figure 16.2

Various types of pointer are used in the file:

The structure of the three sections of the Word file (header, text, formats) is described below.

16.1 Word headers (versions 3.0, 4.0, 5.0)
16.2 The Word text area
16.3 Format area in Word
16.4 Winword file format (1.0-6.0)

16.1 Word headers (versions 3.0, 4.0, 5.0)

As shown in Figure 16.3, Word contains the header information in the first 128 bytes (block 0). The first 4 bytes of the file always contain the hex codes 31H BEH 00H 00H. It is assumed that Word uses these bytes as a signature for formatted files. Table 16.1 shows a detailed breakdown of the header.

Table 16.1: Format of a Word 4.0/5.0 header


Offset     Bytes     Field description
----------------------------------------------------------
00H        4         Word signature 31H BEH 00H 00H
04H        8         Reserved (00H ABH 00H 00H 00H
                     00H 00H 00H)
0EH        4         Pointer to End-of-text
                     (1st character after text)
12H        2         Block pointer to the block containing
                     the paragraph format
14H        2         Block pointer to the block containing
                     the footnote table
16H        2         Block pointer to the block containing
                     the section formats
18H        2         Block pointer to the block containing
                     the nation table
1AH        2         Block pointer to the block containing
                     the table of page breaks
1CH        2         Block pointer to the block containing
                     file manager information
                     (author, date, and so on)
1EH       66         File name of print format, ASCII string
60H        2         Flag (reserved for Windows Write)
62H        8         Name of the printer driver,
                     ASCII string
6AH        2         Number of blocks used in the file
6CH        2         Bit field for corrected text areas
6EH       18         Reserved (in version 4.0 always 00H);
                     after version 5.0 used for unknown
                     code

The bytes column is in decimal. At offset 0EH, there is a 4-byte pointer (file pointer) to the first unused character after the text. The value is interpreted as an offset from the start of the file to the relevant byte. The number of characters in the text can be calculated by subtracting the number 80H (length of block 0). The following 6 words contain 2-byte pointers which are interpreted as block numbers. Information on the format of the text is contained in the specified blocks. A file pointer to the first byte of the block can be calculated by multiplying the block number by 80H. (The structure of format blocks is described below).

The path, including the drive and file name, for the print format template is stored at offset 1EH. This text is an ASCIIZ string, that is, the last character is 00H. In MS-DOS, the path is limited to 65 characters, which explains why 66 characters are reserved in the header. Unused bytes must be set to 00H. At offset 62H, there is an 8-byte field containing the name of the printer driver. If the name is shorter than 8 characters, the remaining bytes must be set to 00H. The drive, the path and the driver extension are not specified.

The field at offset 6AH indicates the number of blocks containing useful information. A Word file may contain additional blocks, but these are usually filled with null bytes and are ignored.

The word at offset 6CH is interpreted as a 16-bit field. It is used to store the format coding for corrected areas of text and generally contains the value 00H 00H, but when a text is modified using the command FORMAT/correction, Word stores the selected settings in the individual bits shown in Table 16.2.

Up to version 4.0, bits 6 to 15 are unused. From Word 5.0 onwards, bits 6 and 7 are used, but their exact meaning is not known. The remaining 18 bytes in the header are reserved and contain the value 00H in version 4.0. From version 5.0, a number of pointers are found in this position, but the significance of these is unknown.

Table 16.2: Coding for FORMAT/ Corrections in Word 4.0


Bit      Field description
----------------------------------------------------
0        Format bar: 1 = Yes, 0 = No
3-1      Inserted text
         000 = Underline
         001 = Large capitals
         010 = Normal
         011 = Bold
         100 = --
         101 = --
         110 = Underline double
         111 = --
5-4      Position of correction bar
         00 = No bar
         01 = Left
         10 = Right
         11 = Alternate (left, right)
15-6     Reserved (until version 4.0, unknown in 5.0)

16.2 The Word text area

In versions 3.0 to 5.0 of Word, the first byte containing text stored in ASCII format begins at offset 80H. This text may extend over several blocks. In the last text block, the area from the last valid text character to the end of the block is undefined. The end of the text is indicated in the header (offset 0EH). If Word stores a blank text window, the text block is omitted, and immediately the format information follows the header.

The text contains only a small number of control characters. Table 16.3 lists some of these codes.

Codes 1-5 are used to mark text blocks created by Word. Footnotes are only marked in the text if the user does not indicate footnote markers. If a footnote is allocated an automatic administration number, Word will store this footnote as normal text and information on formatting the footnotes is stored in a separate block in the trailer.

The characters CR/LF (carriage return/line feed) indicate the end of a paragraph in Word. It is therefore possible to import ASCII files into Word, because many editors place a CR/LF after every line. However, all Word paragraph commands will then be applied to individual lines, because Word will interpret them as paragraphs. It may therefore be necessary to remove the CR/LF characters at the end of individual lines.

Word uses the ASCII code 31 (1FH) to mark possible hyphenation points. The value 255 (FFH) is used to protect the space between words in terms of hyphenation.

Table 16.3: Interpretation of control codes in Word 4.0/5.0


Code       Field description
---------------------------------------------
01H        Text block page
02H        Text block print date
03H        Text block print time
04H        Reserved
05H        Footnote without a footnote marker
09H        Tabulator
0BH        Line feed
0CH        Form feed
0D,0AH     CR/LF as paragraph end
1FH        Hyphenation conditional
C4H        Hyphenation protected
FFH        Space protected

16.3 Format area in Word

The last text block is followed by an area in which Word stores text formatting information. A number of distinct regions can be distinguished: Each of these regions may extend over several 128-byte blocks. The numbers of the first block in each region except the first are stored in the header, starting at offset 12H.

Figure 16.4: Pointers to format regions
Figure 16.4

16.3.1 Character formats
16.3.2 Paragraph format block
16.3.3 Format of the footnote block
16.3.4 Format of the section table block
16.3.5 Format of the section format block
16.3.6 Format of a page-break block
16.3.7 File manager information block

16.3.1 Character formats

The first block after the text contains character formats. Word does not define a specific pointer to this block in the header, because its position can be determined by means of the text pointer at offset 0EH. If there is no text area, the description of character formats begins in block 1. Word uses a very sophisticated technique for storing the character formats. The number of possible combinations for formatting a line (bold, italic, and so on) is predetermined from the start and Word stores these format specifications in a table. Then all that is required is to note how the individual sections of text are to be formatted, as shown in Figure 16.5.

The text shown in Figure 16.5 is to be given the character formats normal, bold and italic as shown in the format table. Whenever bold appears in the text, the program merely refers to the appropriate entry in this table. A pointer marks the start of the bold text. The next format specification cancels the bold. This process is used in Versions 3.0, 4.0 and 5.0. The details shown below relate to Word 4.0 but, to a great extent, they also apply to version 5.0.

The block containing the character formats is structured as shown in Table 16.4. In the first four bytes, there is an offset pointer to the first character in the text to which the format applies. Since this character is always located in block 1, the pointer has the value 00H 00H 00H 80H, but in MS-DOS, the lowest byte is stored first (80H 00H 00H 00H). At offset 04H, there is a pointer table containing two pointers for each format area:

The 4-byte text pointer specifies the offset address of the first character to which the format indicated by the second pointer no longer applies. This text pointer also acts as a start pointer for the new format specification. The next word in the data structure is the pointer to the format definition in the format table at the end of the block. This value is interpreted as an offset from the first text pointer (offset 04H) to the format entry in the format table (Figure 16.6).

Figure 16.5: Text formats
Figure 16.5

Table 16.4: Structure of a character format block


Offset   Bytes     Field description
----------------------------------------------------------
00H      4         Pointer to the 1st character in the
                   1st format
Beginning of table containing text and format pointers:
04H      4         Pointer to 1st char in 2nd format
08H      2         Pointer to format table for 1st format
0AH      4         Pointer to 1st char in different format
                   than 2nd format
08H      2         Pointer to format table for 2nd format
...      ..        ....
Beginning of format table
...      ...       ....
7EH      ...       Last format entry
7FH      1         Number of text areas to be formatted

Since the number of sections to be formatted varies during word processing, Word begins structuring the format table from the end of the block (that is, the last entry at offset 07EH is the first format in the table). Word stores each new format definition before the previous entry. The number of text areas to be formatted, and thus also the number of valid text pointers (excluding the start pointer), is stored at the end of the block (offset 7FH). The structure of the format table is described in more detail below.

Figure 16.6: Position of the table containing the format descriptions
Figure 16.6

With longer texts, the number of text and format pointers may exceed the space available in the pointer table, which will cause the table to overflow. Word then creates a new block for character formats and stores information on the existence of this additional block in the last text pointer -- if the value of this pointer is the same as the start address of the next block, an additional block is involved. As soon as an additional block is required, Word copies the contents of the current (last) block into memory and sets the number of entries (last byte) to zero. The start pointer is set to the value of the last valid text pointer in the preceding block. Thus the copy contains all the information from the preceding block, and Word fills up the new table with text and format pointers as required.

In the pointer table, a 2-byte format pointer is allocated to every 4-byte text pointer. This indicates the offset from the first text pointer to the relevant format definition at the end of the block. If 4 is added to this value, the result is the offset from the start of the block. If the format pointer contains FFFFH, the text is to be displayed in standard format. For example, the format pointer after the last valid text pointer may contain this value in order to switch back to standard format. If the formatted text exceeds a block, the value FFFFH is stored in the following block. Table 16.5 shows the structure of each entry in the format table.

Table 16.5: Structure of a format table entry


Offset     Bytes     Field description
-------------------------------------------------------
00H        1         Number of following bytes for
                     this entry
01H        1         Coding print template:
                     Bit 0 = 1: char formatted with
                     a template,
                     Bits 1-7 define the modes
                     (see Table 16.6)
02H        1         Format code (see Figure 16.8)
03H        1         Font size 1/2 point
04H        1         Character attribute (see Table 16.7)
05H        1         Reserved
06H        1         Character position
                     (Superscript, subscript, and so on)
07H-0AH    4         Reserved

A format generally consists of several bytes. The first byte indicates the number of following bytes in the definition. The minimum length of a format definition is 2 bytes (1 length byte, 1 format byte). However, if only one of the later bytes (for example, the character position) is required, all the intervening fields must also be stored, even though they are not used.

The second byte of the character format specifies the appropriate variant of the print format template, which describes how the text characters are to be formatted. Figure 16.7 shows the coding of the second byte.

Figure 16.7: Definition of a (format) template
Figure 16.7

If the lowest bit (bit 0) is set, the remaining bits will contain the variant of the print format template required. Table 16.7 shows some of the templates given in the Word manual:

Table 16.6: Various (format) templates


Code     Field description
------------------------------
0        Standard character
1-12     Template number 1-12
13       Footnote reference
14-18    Template number 13-17
19       Number of pages
20-27    Template number 18-25
28       Short information
29       Line numbers
30-64    Unused

Additional information on this subject can be found in the standard Word documentation. Word stores information on the format structure (bold, italic, font number) in the third byte (if present). The coding is shown in Figure 16.8.

Figure 16.8: Font format coding
Figure 16.8

Bits 0 and 1 determine the typeface style (bold, italic), while the remaining bits are used for the font number. The allocation of font and font number depends on the printer driver.

The fourth byte specifies the font size in 1/2 points. The remaining character attributes are stored in the fifth byte. The coding is shown in Table 16.7.

So far, the byte at offset 05H has remained reserved. The same applies to the bytes at offsets 07H-0AH. The byte at offset 06H indicates whether a character is to be raised (superscript) or lowered (subscript).

If byte 7 is not equal to 0, bit 7 defines how the character is formatted.

Table 16.7: Character format attributes


Bits     Field description
---------------------------------------------------
0        1 = Underline
1        1 = Strike out
2        1 = Strike out double
3        1 = Insert character in correction mode
4-5      Character size
         00: Normal
         01: Large capitals
         10: --
         11: Capitals
6        Special characters (page, date, and so on)
7        Characters hidden

Table 16.7: Character format attributes


Byte 7     Description
---------------------------------
00H        Character normal
01-7FH     Superscript characters
80-FFH     Subscript characters

16.3.2 Paragraph format block

In the header, at offset 12H, there is a pointer to the block containing the paragraph formatting details. The structure of this block is the same as that of the character format block. The start pointer (4 bytes) specifies the first character of the first paragraph, which is generally the start of the text. The pointer table then begins with a text pointer (4 bytes) and the offset (2 bytes) indicating the relevant format information. The text pointer points to the next paragraph; the format information offset relates to the first text pointer (the underlying structure is shown in Table 16.4). The last byte in the block indicates the number of valid entries (text pointers). If the last valid text pointer is the same as the start address of the next block, a following block containing additional paragraph formats is involved.

However, the structure defining the paragraph formats is somewhat different from the character format structure. The number of following bytes is stored in the first byte. Table 16.8 gives the structure of a paragraph format definition.

Table 16.8: Paragraph format in Word 4.0/5.0


Offset     Bytes     Field description
------------------------------------------------------------
00H        1         Number of following bytes for
                     this entry
01H        1         Coding format template:
                     Bit 0 = 1:
                     Format template is used
                     to format this paragraph
                     Bits 1-7 define the template
                     number (see Table 16.9)
02H        1         Paragraph attribute (see Table 16.10)
03H        1         Number of standard paragraph format
                     (usually code 30 see Table 16.9)
04H        1         Heading level and representation
                     (see Figure 16.8)
05H        2         Right indent in 1/20 point
07H        2         Left indent in 1/20 point
09H        2         Left indent of first line in 1/20 point
0BH        2         Line spacing in 1/20 point
0DH        2         Heading space in 1/20 point
0FH        2         End space in 1/20 point
11H        1         Header/footer and frame details
12H        4         Position of lines round header/footer
13H        4         Reserved (00H)
17H       80         Table of tab descriptions

Table 16.9: Format templates


Code     Field description codes in bits 1-7
----------------------------------------
30       Standard format paragraph
31-38    Paragraph format templates 1-8
39       Paragraph footnote text
40-87    Paragraph format templates 9-56
88-94    Paragraph heading levels 1-7
95-98    Paragraph index levels 1-7
99-102   Paragraph table levels 1-7
103      Paragraph header/footer

The byte at offset 01H specifies the variant of the print format template. As in Figure 16.7, the value 1 in bit 0 indicates that the paragraph is to be formatted with a print format template. In case of retrospective direct formatting, this bit is zeroed, while the remaining bits containing the variant code are retained. The code in bits 1 to 7 indicates the variant of the print format template for paragraph formatting as shown in Table 16.9.

The next byte at offset 02H defines the attribute relating to the alignment of the paragraph (left, right, and so on). Table 16.10 shows the coding for these attributes.

Table 16.10: Coding of paragraph attributes


Bit     Field description
-------------------------------------
0-1     Paragraph align
        00 = Left
        01 = Centered
        10 = Right
        11 = Block
2       Paragraph on same page
3       Next paragraph to same page
4       Use two columns for paragraph
5-7     Reserved

The standard format is initially used for every paragraph. In case of retrospective direct formatting of a particular paragraph, Word stores the information on the paragraph print format in the byte at offset 03H (see Table 16.9).

The byte at offset 04H specifies the classification level of the paragraph and whether the paragraph is to be hidden. The coding of this byte is shown in Figure 16.9:

Figure 16.9: Coding heading levels
Figure 16.9

The next 6 bytes indicate the settings for indent, line spacing, and so on in 1/20 point units (see Table 16.8). At offset 11H, header/footer and frame information is stored. The coding of this byte is shown in Table 16.11.

If bits 4 and 5 contain the value 10, the sides of the frame will be displayed as single lines. The byte at offset 12H specifies the position of these lines (Figure 16.10).

Table 16.11: Coding of frame attributes


Bit     Field description
---------------------------------------
0       0 = Header
        1 = Footer
1       1 = Header/Footer on odd pages
2       1 = Header/Footer on even pages
3       1 = Header/Footer on 1st page
4-5     Frame type
        00 = No frame
        01 = Frame
        10 = Define frame with lines
        11 = --
6-7     Frame lines
        00 = Single frame
        01 = Double frame
        10 = Single frame bold
        11 = --

Figure 16.10: Coding of a frame composed of lines
Figure 16.10

The last part of a paragraph format definition (at offset 17H) contains any references to tabulators in the text. Four bytes are provided for each entry, and the format of these entries is shown in Table 16.12.

The last entry in the tabulator table is not necessarily 4 bytes long; it may contain between 2 and 4 bytes, because the number of directly formatted tabs can be calculated from the length byte at offset 00H.

Table 16.12: Coding of tab format


Offset     Field description
-------------------------------------------------
00H        Indent in 1/20 points from left margin
02H        Tab attributes
           Bits 0-2: Alignment
                         000 = Left
                         001 = Centered
                         010 = Right
                         011 = ?
                         100 = ?
                         101 = ?
                         111 = ?
           Bits 3-5: Fill characters
                         000 = Space
                         001 = .
                         010 = -
                         011 = _
           Bits 6-7: Reserved
03H        Reserved (00H 00H)

16.3.3 Format of the footnote block

Word stores footnotes and the associated references as normal ASCII strings in the text area. To facilitate the management of footnote numbering in the printout, the program creates a separate block for format information; the block number is stored in the pointer at offset 14H in the header. This footnote block does not always exist. If the value of the pointer in the header is the same as that of the pointer to the section format information (offset 16H), there is no footnote information. Otherwise, the block contains a table in which all the footnotes are described. Table 16.13 shows the structure.

The current number of footnotes + 1 present in the text is stored in the first word. The following word contains the maximum number of footnotes ever used in the text (that is, it includes any that have been deleted). Word uses this information to determine how much of the footnote description table (starting at offset 04H) has already been used. This is important, for example, if more than one block is used. For each footnote, a 4-byte text pointer to the position of the footnote reference and a pointer to the actual text of the footnote are stored. The first pair of pointers contains the start and end addresses of the last footnote text -- which explains why the table indicates the number of footnotes + 1. Word uses the first two entries to determine the length of the last footnote text.

Table 16.13: Structure of a footnote block


Offset     Bytes     Field description
----------------------------------------------------
00H        2         Number of footnotes in text + 1
02H        2         Number of footnotes in text + 1
                     (includes deleted footnotes)
Beginning of table containing footnote descriptions
04H        4         Offset of footnote reference
                     (from beginning of text)
08H        4         Offset of footnote text
                     (from beginning of text)
...       ...        .....

16.3.4 Format of the section table block

In Word, a document can be divided into several sections. As soon as the user defines these sections, Word will create a block containing the section table and a block containing the section formats. The pointer in the header at offset 16H is the number of the block containing the section formats, while the number of the block containing the section table is stored at offset 18H. If the block numbers are the same as the block number in the pointer to the page break table (at offset 1AH), the section tables and format blocks do not exist. Otherwise, Word stores the relevant information for each section in these two blocks. The structure of the section table is shown in Table 16.14.

Table 16.14: Structure of a block with a section table


Offset     Bytes     Field description
-------------------------------------------------------------
00H        2         Number of sections
02H        2         Maximum number of sections
Beginning of table containing the section and format pointers
04H        4         Offset of 1st character after
                     this section
08H        2         Reserved
0AH        2         Offset to format description
                     in the section format block
...       ...        .....

The first word contains the total number of sections present; the following word indicates the maximum number of sections created so far. In this way, Word can determine the extent to which this table has alrready been structured. The actual section table begins at offset 04H. This table contains three entries for each section. The first pointer marks the end of a section, and the last entry is interpreted as a pointer to the associated format description, stored as the offset from the start of the section format block to the format description. The middle (second) entry is presumably not used in Word 4.0.

16.3.5 Format of the section format block

The number of the block containing the section formats is stored in the header, at offset 16H. Each section format has the following structure:

Table 16.15: Structure of section format


Offset     Bytes     Field description
--------------------------------------------------------------
00H        1         Number of following bytes in this entry
01H        1         Coding format template
                     Bit 0 = 1: a format template is used to
                     format this section;
                     Bits 1-7 define the template
                     (see Table 16.15)
02H        1         Attribute section (see Table 16.16)
03H        2         Page length in 1/20 point
05H        2         Page width in 1/20 point
07H        2         1st page number of FFFFH for
                     continuous page numbering
09H        2         Upper border in 1/20 point
0BH        2         Length of text field in 1/20 point
0DH        2         Left border in 1/20 point
0FH        2         Text field width in 1/20 points
11H        1         Format section
                     (line number and footnotes)
12H        1         Columns in section
13H        2         Distance of header from top in 1/20 point
15H        2         Distance of footer from top in 1/20 point
17H        2         Distance between columns in 1/20 point
19H        2         Gutter width in 1/20 point
1BH        2         Distance of page numbers from top
                     border in 1/20 point

Table 16.15: Structure of section format


Offset     Bytes     Field description
-------------------------------------------------------
1DH        2         Distance of page numbers from left
                     border in 1/20 point
1FH        2         Distance of line numbers from left
                     border in 1/20 point
21H        2         Line numbers interval

The coding of the print format template for a section is as follows: if bit 0 = 1, a print format template will be used. In this case, bits 1 to 7 contain the variant of the print format required as shown in Table 16.16.

Table 16.16: Variants of print format templates for sections


Code     Field description
--------------------------------------
105      Standard format for a section
106-126  Section format templates 1-21

Table 16.17: The coding of section attributes


Bit     Field description
--------------------------------------
0-2     Section change
        000 = Continuous
        001 = Column
        010 = Page
        011 = Even
        100 = Odd
3-5     Page number
        000 = Arabic numbers
        001 = Large Roman capitals
        010 = Small Roman capitals
        011 = Large capitals
        100 = Small capitals
6-7     Line numbers
        00 = From beginning of page
        01 = From beginning of section
        10 = Continuous

Information such as the format of line numbers and so on is stored in an attribute byte, at offset 02H, coded as shown in Table 16.17.

At offset 11H, there is another byte dealing with footnotes and line numbering. The relevant coding is shown in Figure 16.11.

Figure 16.11: The coding for line numbering
Figure 16.11

16.3.6 Format of a page-break block

The number of the block containing details of page breaks is stored in the Word header, at offset 1AH. This block is not present if the entry is the same as the block numbers for other regions (offsets 16H, 18H, 1CH). Table 16.18 shows the format for page breaks.

The first word contains the number of page breaks. The table containing the locations of the page breaks begins at offset 04H.

Table 16.18: Block containing details of page breaks


Offset     Bytes     Field description
-----------------------------------------------------
00H        2         Number of section with breaks
02H        2         Maximum number of page breaks
Beginning of table containing page-break descriptions
04H        4         Offset of 1st page break
08H        4         Offset of 2nd page break
...       ...        .....

16.3.7 File manager information block

The number of the block containing file manager information is stored at offset 1CH of the header, up to version 5.0 of Word. Word uses this information, for example, when searching through a text or when looking for a particular text. The structure of the block is shown in Table 16.19.

Dates are stored in the form month/day/year (for example, 01.23.90) in ASCII format and terminated with a null byte.

This information need not be present, and the fields can remain unused. In Word 5.0, unused entries in the block are overwritten with DCH.

Information on the internal memory structure has not been published by Microsoft. It is therefore possible that some of the details described in the above sections are not supported in all versions of Word.

Table 16.19: Structure of file manager information block


Offset     Bytes     Field description
------------------------------------------------------
00H        2         Contains 12H 00H
...       ...
Beginning of file manager information
12H       40         Document name
                     (ASCIIZ string, maximum 40 chars)
3AH       12         Author's name
                     (ASCIIZ string, maximum 12 chars)
46H       11         Reviser's name
                     (ASCIIZ string, maximum 12 chars)
51H       14         Keyword (ASCIIZ string, maximum
                     14 chars)
5FH       10         Comment (ASCIIZ string, maximum
                     10 chars)
69H        9         version number
                     (ASCIIZ string, max. 9 chars)
72H        8         Date of last change (MM/DD/YY)
                     (ASCIIZ string)
79H        1         00H
7AH        8         Creation date (MM/DD/YY)
                     (ASCIIZ string)
81H        1         00H
82H        4         Text size

16.4 Winword file format (1.0-6.0)

Winword 1.0, 2.0, 6.0 uses a similar format to Word for DOS to store text. Each DOC file consists of three sections (header, text, format) as described in Figure 16.1. The header and the internal format structure depend on the Winword version. The formats are backward compatible in each successive version. The header of a Winword file contains 384 (17FH) Bytes, followed by the text area. The text area stores the text in ANSI characters. The structure of a Winword header is shown in Table 16.20.

The complete structure of the Winword format is confidential and may not be published here. The information above is public and easy to identify. For further information about the Winword file format contact Microsoft. After signing a licence agreement, a copy of the specification is available.

Table 16.20: Structure of a Winword header


Offset     Bytes     Remarks
---------------------------------------------------------
00         2         Signature
                        9BH A5H (Winword 1.0)
                        DBH A5H (Winword 2.0)
                        DOH CFH (Winword 6.0)
02         2         version (Major)
04         2         version (Minor)
06         2         Language stamp
08         2         Next page number
0A         1         Flag
0B         1         Encryption (1 = Yes)
0C         6         Internal use
12         1         Platform
                            0: Windows
                            1: Mac
13         1         Reserved
14         2         Character set
                           0: ANSI
                        100H: Mac
16H        2         Internal character set
18H        4         Offset to 1st character in text area
1CH        4         Offset to text area end +1
20H        4         Offset to file end
...                  Other file pointers