The actual text is exported and the pictures crabbed by screen captures. The
reference to the non-existing figure 16.2 and two figures with number 16.3
is [sic].
*/
In versions 3.0, 4.0 and 5.0 of Word from Microsoft, a mixed ASCII/binary format is used for the text files. These files are divided into three parts:
Figure 16.1: Structure of an MS-Word file
The data is stored in the file in 128-byte blocks. Some internal pointers take the form of block numbers from which the offset to the first byte of the relevant block can be calculated, using the following formula:
Offset = Block number * 80H
Figure 16.2 shows a sample text produced using Word:
Figure 16.2: Text samples using MS-Word 4.0
Soft hyphenation Soft hy-phen-ation Alt - Hyphenation 2 Hypen-ation Ctrl - Paragraph block text Bold on Bold text Alt B Normal on Normal text Alt Spacebar Italics on Italic text Alt I Underline on Underlined text Alt U Underline double Double-underlined text Alt D Small Capitals Small Capitals Alt K Strike out Strike out text Alt D Superscript Superscript text Alt H Subscript Subscript text Alt T Hidden Hidden text Alt G Centered Centered text Alt Z Left aligned Left-aligned text Alt L Right aligned Right-aligned text Alt R Indent left 1.5 cm Indented text Alt M Indent right Indented text Alt V Indent variable Alt O Standard paragraph Indent for 1st line negative; all other lines of a paragraph are indented left. Double line spacing Alt 2 Capital CAPITALS Alt +
If a block is not completely filled with data, the remainder of the block is undefined. The first block (block 0) of an MS-Word file contains an 128-byte header defining certain items of control information. This is followed by the blocks containing the actual text. With very few exceptions, these do not contain control codes. If there is no text, these blocks are omitted. The trailer consists of a number of blocks containing the format information for the text. A hex-dump of the corresponding Word file is shown in Figure 16.2.
Figure 16.3: Hex-dump of an MS-Word 4.0 file
Figure 16.3: Hex-dump of an MS-Word 4.0 file
Various types of pointer are used in the file:
16.1 Word headers (versions 3.0, 4.0, 5.0)
16.2 The Word text area
16.3 Format area in Word
16.4 Winword file format (1.0-6.0)
Table 16.1: Format of a Word 4.0/5.0 header
Offset Bytes Field description ---------------------------------------------------------- 00H 4 Word signature 31H BEH 00H 00H 04H 8 Reserved (00H ABH 00H 00H 00H 00H 00H 00H) 0EH 4 Pointer to End-of-text (1st character after text) 12H 2 Block pointer to the block containing the paragraph format 14H 2 Block pointer to the block containing the footnote table 16H 2 Block pointer to the block containing the section formats 18H 2 Block pointer to the block containing the nation table 1AH 2 Block pointer to the block containing the table of page breaks 1CH 2 Block pointer to the block containing file manager information (author, date, and so on) 1EH 66 File name of print format, ASCII string 60H 2 Flag (reserved for Windows Write) 62H 8 Name of the printer driver, ASCII string 6AH 2 Number of blocks used in the file 6CH 2 Bit field for corrected text areas 6EH 18 Reserved (in version 4.0 always 00H); after version 5.0 used for unknown code
The bytes column is in decimal. At offset 0EH, there is a 4-byte pointer (file pointer) to the first unused character after the text. The value is interpreted as an offset from the start of the file to the relevant byte. The number of characters in the text can be calculated by subtracting the number 80H (length of block 0). The following 6 words contain 2-byte pointers which are interpreted as block numbers. Information on the format of the text is contained in the specified blocks. A file pointer to the first byte of the block can be calculated by multiplying the block number by 80H. (The structure of format blocks is described below).
The path, including the drive and file name, for the print format template is stored at offset 1EH. This text is an ASCIIZ string, that is, the last character is 00H. In MS-DOS, the path is limited to 65 characters, which explains why 66 characters are reserved in the header. Unused bytes must be set to 00H. At offset 62H, there is an 8-byte field containing the name of the printer driver. If the name is shorter than 8 characters, the remaining bytes must be set to 00H. The drive, the path and the driver extension are not specified.
The field at offset 6AH indicates the number of blocks containing useful information. A Word file may contain additional blocks, but these are usually filled with null bytes and are ignored.
The word at offset 6CH is interpreted as a 16-bit field. It is used to store the format coding for corrected areas of text and generally contains the value 00H 00H, but when a text is modified using the command FORMAT/correction, Word stores the selected settings in the individual bits shown in Table 16.2.
Up to version 4.0, bits 6 to 15 are unused. From Word 5.0 onwards, bits 6 and 7 are used, but their exact meaning is not known. The remaining 18 bytes in the header are reserved and contain the value 00H in version 4.0. From version 5.0, a number of pointers are found in this position, but the significance of these is unknown.
Table 16.2: Coding for FORMAT/ Corrections in Word 4.0
Bit Field description ---------------------------------------------------- 0 Format bar: 1 = Yes, 0 = No 3-1 Inserted text 000 = Underline 001 = Large capitals 010 = Normal 011 = Bold 100 = -- 101 = -- 110 = Underline double 111 = -- 5-4 Position of correction bar 00 = No bar 01 = Left 10 = Right 11 = Alternate (left, right) 15-6 Reserved (until version 4.0, unknown in 5.0)
In versions 3.0 to 5.0 of Word, the first byte containing text stored in ASCII format begins at offset 80H. This text may extend over several blocks. In the last text block, the area from the last valid text character to the end of the block is undefined. The end of the text is indicated in the header (offset 0EH). If Word stores a blank text window, the text block is omitted, and immediately the format information follows the header.
The text contains only a small number of control characters. Table 16.3 lists some of these codes.
Codes 1-5 are used to mark text blocks created by Word. Footnotes are only marked in the text if the user does not indicate footnote markers. If a footnote is allocated an automatic administration number, Word will store this footnote as normal text and information on formatting the footnotes is stored in a separate block in the trailer.
The characters CR/LF (carriage return/line feed) indicate the end of a paragraph in Word. It is therefore possible to import ASCII files into Word, because many editors place a CR/LF after every line. However, all Word paragraph commands will then be applied to individual lines, because Word will interpret them as paragraphs. It may therefore be necessary to remove the CR/LF characters at the end of individual lines.
Word uses the ASCII code 31 (1FH) to mark possible hyphenation points. The value 255 (FFH) is used to protect the space between words in terms of hyphenation.
Table 16.3: Interpretation of control codes in Word 4.0/5.0
Code Field description --------------------------------------------- 01H Text block page 02H Text block print date 03H Text block print time 04H Reserved 05H Footnote without a footnote marker 09H Tabulator 0BH Line feed 0CH Form feed 0D,0AH CR/LF as paragraph end 1FH Hyphenation conditional C4H Hyphenation protected FFH Space protected
Figure 16.4: Pointers to format regions
16.3.1 Character formats
16.3.2 Paragraph format block
16.3.3 Format of the footnote block
16.3.4 Format of the section table block
16.3.5 Format of the section format block
16.3.6 Format of a page-break block
16.3.7 File manager information block
The text shown in Figure 16.5 is to be given the character formats normal, bold and italic as shown in the format table. Whenever bold appears in the text, the program merely refers to the appropriate entry in this table. A pointer marks the start of the bold text. The next format specification cancels the bold. This process is used in Versions 3.0, 4.0 and 5.0. The details shown below relate to Word 4.0 but, to a great extent, they also apply to version 5.0.
The block containing the character formats is structured as shown in Table 16.4. In the first four bytes, there is an offset pointer to the first character in the text to which the format applies. Since this character is always located in block 1, the pointer has the value 00H 00H 00H 80H, but in MS-DOS, the lowest byte is stored first (80H 00H 00H 00H). At offset 04H, there is a pointer table containing two pointers for each format area:
Figure 16.5: Text formats
Table 16.4: Structure of a character format block
Offset Bytes Field description ---------------------------------------------------------- 00H 4 Pointer to the 1st character in the 1st format Beginning of table containing text and format pointers: 04H 4 Pointer to 1st char in 2nd format 08H 2 Pointer to format table for 1st format 0AH 4 Pointer to 1st char in different format than 2nd format 08H 2 Pointer to format table for 2nd format ... .. .... Beginning of format table ... ... .... 7EH ... Last format entry 7FH 1 Number of text areas to be formatted
Since the number of sections to be formatted varies during word processing, Word begins structuring the format table from the end of the block (that is, the last entry at offset 07EH is the first format in the table). Word stores each new format definition before the previous entry. The number of text areas to be formatted, and thus also the number of valid text pointers (excluding the start pointer), is stored at the end of the block (offset 7FH). The structure of the format table is described in more detail below.
Figure 16.6: Position of the table containing the format descriptions
With longer texts, the number of text and format pointers may exceed the space available in the pointer table, which will cause the table to overflow. Word then creates a new block for character formats and stores information on the existence of this additional block in the last text pointer -- if the value of this pointer is the same as the start address of the next block, an additional block is involved. As soon as an additional block is required, Word copies the contents of the current (last) block into memory and sets the number of entries (last byte) to zero. The start pointer is set to the value of the last valid text pointer in the preceding block. Thus the copy contains all the information from the preceding block, and Word fills up the new table with text and format pointers as required.
In the pointer table, a 2-byte format pointer is allocated to every 4-byte text pointer. This indicates the offset from the first text pointer to the relevant format definition at the end of the block. If 4 is added to this value, the result is the offset from the start of the block. If the format pointer contains FFFFH, the text is to be displayed in standard format. For example, the format pointer after the last valid text pointer may contain this value in order to switch back to standard format. If the formatted text exceeds a block, the value FFFFH is stored in the following block. Table 16.5 shows the structure of each entry in the format table.
Table 16.5: Structure of a format table entry
Offset Bytes Field description ------------------------------------------------------- 00H 1 Number of following bytes for this entry 01H 1 Coding print template: Bit 0 = 1: char formatted with a template, Bits 1-7 define the modes (see Table 16.6) 02H 1 Format code (see Figure 16.8) 03H 1 Font size 1/2 point 04H 1 Character attribute (see Table 16.7) 05H 1 Reserved 06H 1 Character position (Superscript, subscript, and so on) 07H-0AH 4 Reserved
A format generally consists of several bytes. The first byte indicates the number of following bytes in the definition. The minimum length of a format definition is 2 bytes (1 length byte, 1 format byte). However, if only one of the later bytes (for example, the character position) is required, all the intervening fields must also be stored, even though they are not used.
The second byte of the character format specifies the appropriate variant of the print format template, which describes how the text characters are to be formatted. Figure 16.7 shows the coding of the second byte.
Figure 16.7: Definition of a (format) template
If the lowest bit (bit 0) is set, the remaining bits will contain the variant of the print format template required. Table 16.7 shows some of the templates given in the Word manual:
Table 16.6: Various (format) templates
Code Field description ------------------------------ 0 Standard character 1-12 Template number 1-12 13 Footnote reference 14-18 Template number 13-17 19 Number of pages 20-27 Template number 18-25 28 Short information 29 Line numbers 30-64 Unused
Additional information on this subject can be found in the standard Word documentation. Word stores information on the format structure (bold, italic, font number) in the third byte (if present). The coding is shown in Figure 16.8.
Figure 16.8: Font format coding
Bits 0 and 1 determine the typeface style (bold, italic), while the remaining bits are used for the font number. The allocation of font and font number depends on the printer driver.
The fourth byte specifies the font size in 1/2 points. The remaining character attributes are stored in the fifth byte. The coding is shown in Table 16.7.
So far, the byte at offset 05H has remained reserved. The same applies to the bytes at offsets 07H-0AH. The byte at offset 06H indicates whether a character is to be raised (superscript) or lowered (subscript).
If byte 7 is not equal to 0, bit 7 defines how the character is formatted.
Table 16.7: Character format attributes
Bits Field description --------------------------------------------------- 0 1 = Underline 1 1 = Strike out 2 1 = Strike out double 3 1 = Insert character in correction mode 4-5 Character size 00: Normal 01: Large capitals 10: -- 11: Capitals 6 Special characters (page, date, and so on) 7 Characters hidden
Table 16.7: Character format attributes
Byte 7 Description --------------------------------- 00H Character normal 01-7FH Superscript characters 80-FFH Subscript characters
However, the structure defining the paragraph formats is somewhat different from the character format structure. The number of following bytes is stored in the first byte. Table 16.8 gives the structure of a paragraph format definition.
Table 16.8: Paragraph format in Word 4.0/5.0
Offset Bytes Field description ------------------------------------------------------------ 00H 1 Number of following bytes for this entry 01H 1 Coding format template: Bit 0 = 1: Format template is used to format this paragraph Bits 1-7 define the template number (see Table 16.9) 02H 1 Paragraph attribute (see Table 16.10) 03H 1 Number of standard paragraph format (usually code 30 see Table 16.9) 04H 1 Heading level and representation (see Figure 16.8) 05H 2 Right indent in 1/20 point 07H 2 Left indent in 1/20 point 09H 2 Left indent of first line in 1/20 point 0BH 2 Line spacing in 1/20 point 0DH 2 Heading space in 1/20 point 0FH 2 End space in 1/20 point 11H 1 Header/footer and frame details 12H 4 Position of lines round header/footer 13H 4 Reserved (00H) 17H 80 Table of tab descriptions
Table 16.9: Format templates
Code Field description codes in bits 1-7 ---------------------------------------- 30 Standard format paragraph 31-38 Paragraph format templates 1-8 39 Paragraph footnote text 40-87 Paragraph format templates 9-56 88-94 Paragraph heading levels 1-7 95-98 Paragraph index levels 1-7 99-102 Paragraph table levels 1-7 103 Paragraph header/footer
The byte at offset 01H specifies the variant of the print format template. As in Figure 16.7, the value 1 in bit 0 indicates that the paragraph is to be formatted with a print format template. In case of retrospective direct formatting, this bit is zeroed, while the remaining bits containing the variant code are retained. The code in bits 1 to 7 indicates the variant of the print format template for paragraph formatting as shown in Table 16.9.
The next byte at offset 02H defines the attribute relating to the alignment of the paragraph (left, right, and so on). Table 16.10 shows the coding for these attributes.
Table 16.10: Coding of paragraph attributes
Bit Field description ------------------------------------- 0-1 Paragraph align 00 = Left 01 = Centered 10 = Right 11 = Block 2 Paragraph on same page 3 Next paragraph to same page 4 Use two columns for paragraph 5-7 Reserved
The standard format is initially used for every paragraph. In case of retrospective direct formatting of a particular paragraph, Word stores the information on the paragraph print format in the byte at offset 03H (see Table 16.9).
The byte at offset 04H specifies the classification level of the paragraph and whether the paragraph is to be hidden. The coding of this byte is shown in Figure 16.9:
Figure 16.9: Coding heading levels
The next 6 bytes indicate the settings for indent, line spacing, and so on in 1/20 point units (see Table 16.8). At offset 11H, header/footer and frame information is stored. The coding of this byte is shown in Table 16.11.
If bits 4 and 5 contain the value 10, the sides of the frame will be displayed as single lines. The byte at offset 12H specifies the position of these lines (Figure 16.10).
Table 16.11: Coding of frame attributes
Bit Field description --------------------------------------- 0 0 = Header 1 = Footer 1 1 = Header/Footer on odd pages 2 1 = Header/Footer on even pages 3 1 = Header/Footer on 1st page 4-5 Frame type 00 = No frame 01 = Frame 10 = Define frame with lines 11 = -- 6-7 Frame lines 00 = Single frame 01 = Double frame 10 = Single frame bold 11 = --
Figure 16.10: Coding of a frame composed of lines
The last part of a paragraph format definition (at offset 17H) contains any references to tabulators in the text. Four bytes are provided for each entry, and the format of these entries is shown in Table 16.12.
The last entry in the tabulator table is not necessarily 4 bytes long; it may contain between 2 and 4 bytes, because the number of directly formatted tabs can be calculated from the length byte at offset 00H.
Table 16.12: Coding of tab format
Offset Field description ------------------------------------------------- 00H Indent in 1/20 points from left margin 02H Tab attributes Bits 0-2: Alignment 000 = Left 001 = Centered 010 = Right 011 = ? 100 = ? 101 = ? 111 = ? Bits 3-5: Fill characters 000 = Space 001 = . 010 = - 011 = _ Bits 6-7: Reserved 03H Reserved (00H 00H)
The current number of footnotes + 1 present in the text is stored in the first word. The following word contains the maximum number of footnotes ever used in the text (that is, it includes any that have been deleted). Word uses this information to determine how much of the footnote description table (starting at offset 04H) has already been used. This is important, for example, if more than one block is used. For each footnote, a 4-byte text pointer to the position of the footnote reference and a pointer to the actual text of the footnote are stored. The first pair of pointers contains the start and end addresses of the last footnote text -- which explains why the table indicates the number of footnotes + 1. Word uses the first two entries to determine the length of the last footnote text.
Table 16.13: Structure of a footnote block
Offset Bytes Field description ---------------------------------------------------- 00H 2 Number of footnotes in text + 1 02H 2 Number of footnotes in text + 1 (includes deleted footnotes) Beginning of table containing footnote descriptions 04H 4 Offset of footnote reference (from beginning of text) 08H 4 Offset of footnote text (from beginning of text) ... ... .....
Table 16.14: Structure of a block with a section table
Offset Bytes Field description ------------------------------------------------------------- 00H 2 Number of sections 02H 2 Maximum number of sections Beginning of table containing the section and format pointers 04H 4 Offset of 1st character after this section 08H 2 Reserved 0AH 2 Offset to format description in the section format block ... ... .....
The first word contains the total number of sections present; the following word indicates the maximum number of sections created so far. In this way, Word can determine the extent to which this table has alrready been structured. The actual section table begins at offset 04H. This table contains three entries for each section. The first pointer marks the end of a section, and the last entry is interpreted as a pointer to the associated format description, stored as the offset from the start of the section format block to the format description. The middle (second) entry is presumably not used in Word 4.0.
Table 16.15: Structure of section format
Offset Bytes Field description -------------------------------------------------------------- 00H 1 Number of following bytes in this entry 01H 1 Coding format template Bit 0 = 1: a format template is used to format this section; Bits 1-7 define the template (see Table 16.15) 02H 1 Attribute section (see Table 16.16) 03H 2 Page length in 1/20 point 05H 2 Page width in 1/20 point 07H 2 1st page number of FFFFH for continuous page numbering 09H 2 Upper border in 1/20 point 0BH 2 Length of text field in 1/20 point 0DH 2 Left border in 1/20 point 0FH 2 Text field width in 1/20 points 11H 1 Format section (line number and footnotes) 12H 1 Columns in section 13H 2 Distance of header from top in 1/20 point 15H 2 Distance of footer from top in 1/20 point 17H 2 Distance between columns in 1/20 point 19H 2 Gutter width in 1/20 point 1BH 2 Distance of page numbers from top border in 1/20 point
Table 16.15: Structure of section format
Offset Bytes Field description ------------------------------------------------------- 1DH 2 Distance of page numbers from left border in 1/20 point 1FH 2 Distance of line numbers from left border in 1/20 point 21H 2 Line numbers interval
The coding of the print format template for a section is as follows: if bit 0 = 1, a print format template will be used. In this case, bits 1 to 7 contain the variant of the print format required as shown in Table 16.16.
Table 16.16: Variants of print format templates for sections
Code Field description -------------------------------------- 105 Standard format for a section 106-126 Section format templates 1-21
Table 16.17: The coding of section attributes
Bit Field description -------------------------------------- 0-2 Section change 000 = Continuous 001 = Column 010 = Page 011 = Even 100 = Odd 3-5 Page number 000 = Arabic numbers 001 = Large Roman capitals 010 = Small Roman capitals 011 = Large capitals 100 = Small capitals 6-7 Line numbers 00 = From beginning of page 01 = From beginning of section 10 = Continuous
Information such as the format of line numbers and so on is stored in an attribute byte, at offset 02H, coded as shown in Table 16.17.
At offset 11H, there is another byte dealing with footnotes and line numbering. The relevant coding is shown in Figure 16.11.
Figure 16.11: The coding for line numbering
The first word contains the number of page breaks. The table containing the locations of the page breaks begins at offset 04H.
Table 16.18: Block containing details of page breaks
Offset Bytes Field description ----------------------------------------------------- 00H 2 Number of section with breaks 02H 2 Maximum number of page breaks Beginning of table containing page-break descriptions 04H 4 Offset of 1st page break 08H 4 Offset of 2nd page break ... ... .....
Dates are stored in the form month/day/year (for example, 01.23.90) in ASCII format and terminated with a null byte.
This information need not be present, and the fields can remain unused. In Word 5.0, unused entries in the block are overwritten with DCH.
Information on the internal memory structure has not been published by Microsoft. It is therefore possible that some of the details described in the above sections are not supported in all versions of Word.
Table 16.19: Structure of file manager information block
Offset Bytes Field description ------------------------------------------------------ 00H 2 Contains 12H 00H ... ... Beginning of file manager information 12H 40 Document name (ASCIIZ string, maximum 40 chars) 3AH 12 Author's name (ASCIIZ string, maximum 12 chars) 46H 11 Reviser's name (ASCIIZ string, maximum 12 chars) 51H 14 Keyword (ASCIIZ string, maximum 14 chars) 5FH 10 Comment (ASCIIZ string, maximum 10 chars) 69H 9 version number (ASCIIZ string, max. 9 chars) 72H 8 Date of last change (MM/DD/YY) (ASCIIZ string) 79H 1 00H 7AH 8 Creation date (MM/DD/YY) (ASCIIZ string) 81H 1 00H 82H 4 Text size
The complete structure of the Winword format is confidential and may not be published here. The information above is public and easy to identify. For further information about the Winword file format contact Microsoft. After signing a licence agreement, a copy of the specification is available.
Table 16.20: Structure of a Winword header
Offset Bytes Remarks --------------------------------------------------------- 00 2 Signature 9BH A5H (Winword 1.0) DBH A5H (Winword 2.0) DOH CFH (Winword 6.0) 02 2 version (Major) 04 2 version (Minor) 06 2 Language stamp 08 2 Next page number 0A 1 Flag 0B 1 Encryption (1 = Yes) 0C 6 Internal use 12 1 Platform 0: Windows 1: Mac 13 1 Reserved 14 2 Character set 0: ANSI 100H: Mac 16H 2 Internal character set 18H 4 Offset to 1st character in text area 1CH 4 Offset to text area end +1 20H 4 Offset to file end ... Other file pointers