----------------------------- INSIDE TURBO PASCAL 6.0 UNITS ----------------------------- by William L. Peavy ----------------- Revised: April 16, 1991 ABSTRACT If you want to know what is in a .TPU (unit) file produced by Version 6.0 of Turbo Pascal from Borland International, then this paper is for you. It doesn't explain quite everything since the I don't have access to secret documents or anything like that and since some of the data in .TPU files just doesn't have enough auxiliary information to make its role clear. However, it is possible to learn a great deal about how Turbo Pascal organizes the information it needs to refer to, and it is also possible to learn just what kind of code the compiler produces. This is the third in a series of reports on the subject of Turbo Pascal Units, the first treating with Turbo Pascal Version 5.0 and the second with Turbo Pascal 5.5. The evolution of these files in the face of changing requirements has been fascinating to behold and deciphering their contents has been challenging to say the least. The programs supplied with this report have been reorganized from their 5.5 style in some ways and many identifiers have been changed. These changes were more for style than for substance. Other changes were dictated by the changes in the organization of the TPU file itself and certain errors in the 5.5 programs have been corrected. In addition, other errors of interpretation have been fixed which has led to some enhanced descriptive capability. Since I have a "real" job which requires my full attention, and since it doesn't involve use of these products in any direct way, I am usually hard-pressed to find the personal time to conduct this research. Consequently, I always refuse to commit to follow-up or even error correction. It would be irresponsible of me to pretend it could be otherwise. Even so, this is a revised report which contains a few error fixes and discusses the newly enhanced program which incorporates these fixes and sports some enhanced capabilities. Contents Introduction ................................................. 5 1. Gross File Structure ...................................... 5 1.1 User Units ........................................... 6 2. Locators .................................................. 7 2.1 Local Links .......................................... 7 2.2 Global Links ......................................... 7 2.3 Table Offsets ........................................ 7 2.4 Basic Relationships .................................. 8 3. Unit Header .............................................. 11 3.1 Description ......................................... 11 3.2 UNIT Size ........................................... 14 4. Symbol Dictionaries ...................................... 14 4.1 Organization ........................................ 14 4.2 Interface Dictionary ................................ 14 4.3 Debug Dictionary .................................... 15 4.4 Dictionary Elements ................................. 15 4.4.1 Hash Tables ................................... 15 4.4.1.1 Size .................................... 16 4.4.1.2 Scope ................................... 16 4.4.1.3 Special Cases ........................... 17 4.4.2 Dictionary Headers ............................ 17 4.4.3 Dictionary Stubs .............................. 18 4.4.3.1 Label Declaratives ("O") ................ 18 4.4.3.2 Un-Typed Constants ("P") ................ 18 4.4.3.3 Named Types ("Q") ....................... 18 4.4.3.4 Variables, Fields, Typed Cons ("R") ..... 19 4.4.3.5 Subprograms & Methods ("S") ............. 20 4.4.3.6 Turbo Std Procedures ("T") .............. 21 4.4.3.7 Turbo Std Functions ("U") ............... 21 4.4.3.8 Turbo Std "NEW" Routine ("V") ........... 21 4.4.3.9 Turbo Std Port Arrays ("W") ............. 21 4.4.3.10 Turbo Std External Variables ("X") ..... 21 4.4.3.11 Units ("Y") ............................ 22 4.4.4 Type Descriptors .............................. 22 4.4.4.1 Scope ................................... 23 4.4.4.2 Prefix Part ............................. 23 4.4.4.3 Suffix Parts ............................ 24 4.4.4.3.1 Un-Typed .......................... 25 4.4.4.3.2 Structured Types .................. 25 4.4.4.3.2.1 ARRAY Types ................. 25 4.4.4.3.2.2 RECORD Types ................ 25 4.4.4.3.2.3 OBJECT Types ................ 26 4.4.4.3.2.4 FILE (non-TEXT) Types ....... 27 4.4.4.3.2.5 TEXT File Types ............. 27 4.4.4.3.2.6 SET Types ................... 27 - iii - Contents 4.4.4.3.2.7 POINTER Types ............... 27 4.4.4.3.2.8 STRING Types ................ 27 4.4.4.3.3 Floating-Point Types .............. 27 4.4.4.3.4 Ordinal Types ..................... 28 4.4.4.3.4.1 "Integers" .................. 28 4.4.4.3.4.2 BOOLEANs .................... 28 4.4.4.3.4.3 CHARs ....................... 28 4.4.4.3.4.4 ENUMERATions ................ 29 4.4.4.3.5 SUBPROGRAM Types .................. 29 5. Maps and Lists ........................................... 30 5.1 PROC Map ............................................ 30 5.2 CSeg Map ............................................ 31 5.3 Typed CONST DSeg Map ................................ 31 5.4 Global VAR DSeg Map ................................. 32 5.5 Donor Unit List ..................................... 32 5.6 Source File List .................................... 33 5.7 DEBUG Trace Table ................................... 34 6. Code, Data, Fix-Up Info .................................. 35 6.1 Object CSegs ........................................ 35 6.2 CONST DSegs ......................................... 35 6.3 Fix-Up Data Table ................................... 36 7. Supplied Program ......................................... 37 7.1 TPU6 ................................................ 37 7.1.1 UNIT TPU6AMS .................................. 37 7.1.2 UNIT TPU6EQU .................................. 38 7.1.3 UNIT TPU6UTL .................................. 38 7.1.4 UNIT TPU6RPT .................................. 38 7.1.5 UNIT TPU6UNA .................................. 38 7.2 Modifications ....................................... 39 7.3 Notes on Program Logic .............................. 39 7.3.1 Formatting the Dictionary ..................... 39 7.3.2 The Disassembler .............................. 41 8. Unit Libraries ........................................... 43 8.1 Library Structure ................................... 43 9. Application Notes ........................................ 44 10. Acknowledgements ........................................ 45 11. References .............................................. 46 INDEX ....................................................... 47 - iv - Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- INTRODUCTION This document is the outcome of an inquiry conducted into the structure and content of Borland Turbo Pascal (Version 6.0) Unit files. The original purpose of the inquiry was to provide a body of theory enabling Cross-Reference programs to resolve references to symbols defined in .TPU files where qualification was not explicitly provided. As is so often the case, one thing led to another and the scope of the inquiry was expanded dramatically. While this document should not be regarded as definitive, the author feels that the entire Turbo Pascal User community might gain from the information extracted from these files at the cost of so much time and effort. The material contained herein represents the findings and interpretations of the author. A great deal of guess-work was required and no assurances are given as to the accuracy of either the findings of fact or the inferences contained herein which are the sole work-product of the author. In particular, the author had access only to materials or information that any normal Borland customer has access to. Further, no Borland source-codes were available as the Library Routine source is not licensed to the author. In short, there was nothing irregular about how these findings were achieved. The material contained herein is placed in the public domain free of copyright for use of the general public at its own risk. The author assumes no liability for any damages arising from the use of this material by others. If you make use of this information and you get burned, TOUGH! The author accepts no obligation to correct any such errors as may exist in the supplied programs or in the findings of fact or opinion contained herein. On the other hand, this is not a "complete" work in that a great many questions remain open, especially as regards fine details. (The author is not highly-qualified in Intel 80xxx Assembly Language and several open questions might best be addressed by persons competent in this area.) The author welcomes the input of interested readers who might be able to "flesh-out" some of these open questions with "hard" answers. 1. GROSS FILE STRUCTURE A Turbo Pascal Unit file consists of an array of bytes that is some exact multiple of sixteen (16). "Signature" information allows the compiler to verify that the .TPU file was compiled with the correct compiler version and to verify that the file is of the correct size. The fine structure of the file will be addressed in later sections at ever increasing levels of detail. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 5 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- Graphically, the file may be regarded as having the following general layout: +-------------------+ | Unit Header | Main Index to Unit File |-------------------| | Dictionaries: | | a) Interface | | b) Debug * | For Local Symbol Access |-------------------| | PROC Map | |-------------------| | CSeg Map * | May be Empty |-------------------| | CONST DSeg Map * | May be Empty |-------------------| | VAR DSeg Map * | May be Empty |-------------------| | Donor Units * | May be Empty |-------------------| | Source Files | |-------------------| | Trace Table * | May be Empty |-------------------| | CODE Segment(s) * | May be Empty |-------------------| | DATA Segment(s) * | May be Empty |-------------------| | FIX-UP Data * | May be Empty +-------------------+ 1.1 USER UNITS Units prepared by the compiler available to ordinary users have a very straight-forward appearance and content. There may even be a little "wasted" space that might be removed if the compiler were just a little cleverer. The SYSTEM.TPU file is quite another thing however. The SYSTEM.TPU file (found in TURBO.TPL) is extraordinary in that great pains seem to have been taken to compact it. Further, it contains a great many types of entries that just don't seem to be achievable by ordinary users and I suspect that much (if not all) of it was "hand-coded" in Assembler Language. In the following sections, the details of these optimizations will be explained in the context of the structural element then under discussion. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 6 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 2. LOCATORS The data in these files has need of structure and organization to support efficient access by the various programs such as the compiler, the linker and the debugger. This organization is built on a solid foundation of locators employed in the unit's data structures. 2.1 LOCAL LINKS Local Links (LL's) are items of type WORD (2 bytes) which contain an offset which is relative to the origin of the unit file itself. This implies that a unit must be somewhat less than 64K bytes in size. If the .TPU file is loaded into the heap, then an LL can be used to locate any byte in the segment beginning with the load point of the file. 2.2 GLOBAL LINKS Global Links (LG's) are used to locate type descriptors and to locate allocation data for variables with the ABSOLUTE attribute which may reside in other Units (i.e., units external to the present unit). LG's are structured items consisting of two (2) words. The first of these is an LL that is relative to the origin of the (possibly) external unit. It locates either a Type Descriptor or the stub of the Dictionary entry which establishes storage allocation. The second word is an LL which locates the stub of the unit entry in the current unit dictionary for the (possibly) external unit. This dictionary entry provides the name of the unit that contains the item the LG points to. This provides a handy mechanism for locating type descriptors and allocation information which may be defined in other separately compiled units. 2.3 TABLE OFFSETS Finally, various data-structures within a .TPU file are organized as arrays of fixed-length records or as lists of variable-length records. Efficient access to such records is achieved by means of offsets rather than subscripts (an addressing technique denied Pascal). These offsets are relative to the origin of the array or list being referenced rather than the origin of the unit. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 7 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 2.4 BASIC RELATIONSHIPS +-------------+ +----------------------+ | Unit | | INTERFACE Dictionary | | Header | | | +-------------+ | Public and Private | | | Names, Nested Hash | | LL +----------------+ LL's | Tables, INLINE code, | |-------->| INTERFACE Hash |------->| Type Descriptors. | | +----------------+ +----------------------+ | (LL's ^ & LG's) | +----------------------+ | LL +----------------+ LL's | DEBUG Dictionary | |-------->| DEBUG Hash |------->| IMPLEMENTATION and | | +----------------+ | nested scope names, | | ?| stored for DEBUG. | | LL +----------------+ | Same structure as in | |-------->| PROC Map Table | | INTERFACE. Linked | | +----------------+ | to INTERFACE part by | | LL +----------------+ | LL's. BUILT ONLY IF | |-------->| CSeg Map Table |? | LOCAL SYMBOLS ARE | | +----------------+ | ENABLED AT COMPILE. | | LL +----------------+ +----------------------+ |-------->| DSeg Map CONST |? | +----------------+ | LL +----------------+ |-------->| DSeg Map VAR's |? | +----------------+ IMPORTANT NOTES | LL +----------------+ ---------------------- |-------->| Donor Unit List|? Some of the structures | +----------------+ shown in this figure | LL +------------------+ are built only if they |-------->| Source File List | are needed. These are | +------------------+ marked by a "?" next | LL +------------------+ to the box. |-------->| Debug Step Ctls |? | +------------------+ If the DEBUG Dictionary | ** +---------------+ is missing, its LL |-------->| CODE Segments |? leads directly to the | +---------------+ INTERFACE Dictionary. | ** +-----------------+ ---------------------- |-------->| CONST DATA Segs |? | +-----------------+ | ** +----------------+ +-------->| Fix-Up Lists |? +----------------+ This figure illustrates the role of the Unit Header in tying together the various data structures in the Unit. The type of link is shown next to a flow-line by "LL", "LG" or "**". "LL" and "LG" are explicit pointers while "**" shows a locator whose value is computed using other data in the Unit Header and that no explicit pointer exists. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 8 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- +----(from hash tables,other Dictionary Entries) | | +------------------------------------------------+ | | Header Part | Stub Part -- many formats | +--->| - - - - - - | - - - +------------------------- | | | data, | Some stubs have embedded | Dictionary | Name, Class | links | Type Descriptors | Entry | and link to | (see | +------------------- | | entries who | below)| | INLINE Declarative | | have same | * | | code bytes for a | | hash | | | | "macro" type PROC | +-----------------|------------------------------+ +----------+ | | FAR pntr +----------------------------+ |----------->| Absolute Memory Locations | | +----------------------------+ | +-----------------------------+ | LG's | Type Descriptors and stubs | |----------->| of Dictionary Entries used | | | for absolute equivalences | | +-----------------------------+ | +---------------------------------+ | LL's | Nested Scope Hash Tables | |----------->| Parent Scope Dictionary Entries | | | Record Fields | | | Object Fields/Methods | | +---------------------------------+ | +----------------------+ | Offsets | CONST DSeg Map Table | +----------->| PROC Map Table | | VAR DSeg Map Table | +----------------------+ This figure illustrates the many types of entities that associate with Dictionary Entries and particularly with their Stub Parts. Not all of the links shown occur in a single Stub format, but all of the links in the figure can and do exist in selected cases. The purpose here is to show the flexibility of the system of links in associating required data with the Dictionary Entry and its identifying symbol. While it may not be apparent from the figure, the dictionary structure as a whole may be viewed as a cyclic directed graph which is rooted in the DEBUG Hash Table. The recursive properties exhibited by the node relationships permit direct support of the scope rules of Turbo Pascal with simplicity and elegance. As one might expect, the representation of the required information lends itself to efficient use of storage since the representations are compact and there is very little in the way of redundancy. The small amount of redundancy that does exist is apparently aimed at speeding access to certain structures by the Turbo components (compiler, linker and debugger). ---------------------------------------------------------------------- Rev: April 16, 1991 Page 9 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- +----(implied links, explicit LG's from other structures) | | +---------------------------------------------+ | | Flags and codes, allocation widths for data | Type +--->| and VMT's, subrange constraints, formal | Descriptor | parameter descriptors, implicit associated | Contents & | type descriptors, LL's, LG's and Offsets. | Linkages +---------------------------------------------+ | | | LG's +------------------+ |-------------->| Type Descriptors | | +------------------+ | | +-------------------------------+ | LL's | Method Dictionary Entries | |-------------->| Nested Scope Hash Tables | | | Nested Scope Field Chains | | | Parent Scope Dictionary Entry | | +-------------------------------+ | | Offsets +----------------------------------+ +-------------->| VMT pointers in Object Instances | | CONST DSeg Map Table Entries | +----------------------------------+ This figure illustrates the relationships between Type Descriptors and other structures in the dictionary. Not all the links shown can exist with a single Type Descriptor since there are several variant forms of these descriptors (depending on base type) but in combination, these linkages are feasible. In addition to links, a great amount of data is stored which is peculiar to a given type declaration. Descriptors can be -- and are -- shared. Indeed, they were designed with that in mind. Once a named type is declared, all entities that reference it are linked to it in some way (usually by an LG). Almost every form of type descriptor is found in the SYSTEM unit and this fact is used to advantage. When un-typed constants are declared, a built-in type descriptor is referenced (via an LG) which provides necessary information for maintenance of orderly dictionary structure. When a named-type is declared, it is almost always decomposed into an expression based on the built-in types of Turbo Pascal which are found in the SYSTEM unit with the aid of an LG. The semantics underlying the idea of the Unit mandate this very approach since program modules of any class which make references to units for definitions use the definitions as implemented by the unit which contains them. Re-defining the unit or any of its defined types leads to a natural requirement to re-compile those program modules which rely on the unit for definitions. The impact is fundamental since the storage representation of a unit-defined named type can change in quite radical ways. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 10 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 3. UNIT HEADER The Unit Header comprises the first 64 bytes of the .TPU file. It contains LL's that effectively locate all other sections of the .TPU file plus statistics that enable a little cross-checking to be performed. Some parts of the Unit Header appear to be reserved for future use since no unit examined by this author has ever contained non-zero data in these apparently reserved fields. 3.1 DESCRIPTION The Unit Header provides a high-level locator table whereby each major structure in the unit file can be addressed. The following provides a Pascal-like explanation of the layout of the header followed by further narrative discussion of the contents of the individual fields in the Unit Header. Type HdrAry = Array[0..3] of Char; LL = Word; UnitHeader = Record UHEYE : HdrAry; { +00 : = 'TPU9' } UHxxx : HdrAry; { +04 : = $00000000 } UHUDH : LL; { +08 : to Dictionary Head-This Unit } UGIHT : LL; { +0A : to Hash Table (INTERFACE) } UHPMT : LL; { +0C : to PROC Map } UHCMT : LL; { +0E : to CSeg Map } UHTMT : LL; { +10 : to DSeg Map-Typed CONST's } UHDMT : LL; { +12 : to DSeg Map-GLOBAL Variables } UHxxy : LL; { +14 : Purpose Unknown } UHLDU : LL; { +16 : to Donor Unit List } UHLSF : LL; { +18 : to Source file List } UHDBT : LL; { +1A : to Debug Trace Step Controls } UHENC : LL; { +1C : to end non-code part of Unit } UHZCS : Word; { +1E : Size of CSEGs (aggregate) } UHZDT : Word; { +20 : Size of Typed Constant Data } UHZFA : Word; { +22 : Fix-Up Bytes (CSegs) } UHZFT : Word; { +24 : Fix-Up Bytes (Typed CONST's) } UHZFV : Word; { +26 : Size of GLOBAL VAR Data } UHDHT : LL; { +28 : to Hash Table (DEBUG) } UHSOV : Word; { +2A : Overlay Involved if non-zero } UHPad : Array[0..9] of Word; { +2C : Reserved for Future Expansion } End; { UnitHeader } UHEYE contains the characters "TPU9" in that order. This is clear evidence that this unit was compiled by Turbo Pascal Version 6.0. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 11 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- UHxxx is apparently reserved and contains binary zeros. UHUDH contains an LL (WORD) which points to the Dictionary Header in which the name of this unit is found. UHIHT contains an LL (WORD) which points to a Hash table that is the root of the Interface Dictionary graph. UHPMT contains an LL (WORD) which points to the PROC Map for this unit. The PROC Map contains an entry for each Procedure or Function declared in the unit (except for INLINE types), plus an entry for the Unit Initialization section. The length of the PROC Map (in bytes) is determined by subtracting this UHPMT from UHCMT. UHCMT contains an LL (WORD) which points to the CSeg (CODE Segment) Map for this unit. The CSeg Map contains an entry for each CODE Segment produced by the compiler plus an entry for each of the CODE Segments included via the {$L filename.OBJ} compiler directive. The length of this Map (in bytes) is obtained by subtracting UNCMT from UHTMT. The result may be zero in which case the CSeg Map is empty. UHTMT contains an LL (WORD) which points to the DSeg (DATA Segment) Map that maps the initializing data for Typed CONST items plus templates for VMT's (Virtual Method Tables) that are associated with OBJECTS which employ Virtual Methods. The length of this Map (in bytes) is obtained by subtracting UHTMT from UHDMT. The result may be zero in which case this DSeg Map is empty. UHDMT contains an LL (WORD) which points to the DSeg (DATA Segment) Map that contains the specifications for DSeg storage required by VARiables whose scope is GLOBAL. The length of this Map (in bytes) is obtained by subtracting UHDMT from UHxxy. The result may be zero in which case this DSeg Map is empty. UHxxy Purpose of this word is unknown. No non-zero values have ever been observed here. (May be for TP-Windows?) UHLDU contains an LL (WORD) which points to a table of units which contribute either CODE or DATA Segments to the .EXE file for a program using this Unit. This is called the "Donor Unit Table". The length of this table (in bytes) is obtained by subtracting UHLDU from the word UHLSF. The result may be zero in which case this table is empty. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 12 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- UHLSF contains an LL (WORD) which points to a list of "source" files. These are the files whose CODE or DATA Segments are included in this Unit by the compiler. Examples are the Pascal Source for the Unit itself, plus the .OBJ files included via the {$L filename.OBJ} compiler directive. The length of this table (in bytes) is obtained by subtracting UHLSF from the word UHDBT. The result may be zero in which case this table is empty. UHDBT contains an LL (WORD) which points to a Trace Table used by the DEBUGGER for "stepping" through a Function or Procedure contained in this Unit. The length of this table (in bytes) is obtained by subtracting UHDBT from the word UHENC. The result may be zero in which case this table is empty. UHENC contains an LL (WORD) which points to the first free byte which follows the Trace Table (if any). It serves as a delimiter for determining the size of the Trace Table. This LL (when rounded up to the next integral multiple of 16) serves to locate the start of the code/data segments. UHZCS is a WORD that contains the total byte count of all CODE Segments compiled into this Unit. UHZDT is a WORD that contains the total byte count of all Typed CONST and VMT DATA Segments compiled into this unit. UHZFA is a WORD that contains the total byte count of the Fix-Up Data Table for this unit for CODE (CSegs). UHZFT is a WORD that contains the total byte count of the Fix-Up Data Table for Typed CONST's. This usually implies that a VMT is getting its pointers relocated. UHZFV is a WORD that contains the total byte count of all GLOBAL VAR DATA Segments compiled into this unit. UHDHT contains an LL (WORD) which points to a Hash Table which is the root of the DEBUGGER Dictionary. If Local Symbols were generated by the compiler (directive {$L+}) then ALL symbols declared in the unit can be accessed from this Hash Table. If Local Symbols were suppressed there is no such Dictionary and the LL stored here points to the INTERFACE Dictionary. UHSOV Purpose of this word is unknown. It has been observed to be non-zero when overlay directives are used. So far however, this hasn't enabled me to come up with a good guess as to just what the observed values actually mean. UHPad begins a series of ten (10) words that are apparently reserved for future use. Nothing but zeros have ever been seen here by this author. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 13 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 3.2 UNIT SIZE An independent check on the size of the .TPU file is available using information contained in the Unit Header. This is also important for .TPL (Unit Library) organization. To compute the file :size, refer to the five (5) words -- UHENC, UHZCS, UHZDT, UHZFA, and UHZFT. Round the contents of each of these words to the lowest multiple of 16 that is greater than or equal to the content of that word. Then form the sum of the rounded words. This is the .TPU file size in bytes. 4. SYMBOL DICTIONARIES This area contains all available documentation of declared symbols and procedure blocks defined within the unit. Depending on compiler options in effect when the unit was compiled, this section will contain at a minimum, the INTERFACE declarations, and at a maximum, ALL declarations. The information stored in the dictionary is highly dependent on the context of the symbol declared. We defer further explanation to the appropriate section which follows. 4.1 ORGANIZATION A dictionary is organized with a Hash Table as its root. The hash table is used to provide rapid access to identifiers. A dictionary may be thought of as a directed graph. Each subgraph is rooted in a hash table. There may be a great many hash tables in a given unit and their number depends on unit complexity as well as the options chosen when the unit was compiled. Use of the {$L+} directive produces the largest dictionaries. The hash tables are explained in detail a few sections further on. Hash tables point to Dictionary Headers. When two or more symbols produce the same hash function result, a collision is said to occur. Collisions are resolved by the time-honored method of chaining together the Dictionary Headers of those symbols having the same hash function result. Dictionary supersetting is accomplished using these chains. 4.2 INTERFACE DICTIONARY The INTERFACE dictionary contains all symbols and the necessary explanatory data for the INTERFACE section of a Unit. Symbols get added to the Unit using increasing storage addresses until the IMPLEMENTATION section is encountered. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 14 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.3 DEBUG DICTIONARY The Debug dictionary (if present) is a superset of the INTERFACE dictionary. It is used by the Turbo Debugger to support its many features when tracing through a unit. If present, this dictionary is rooted in its own hash table. The hash table is effectively initialized when the IMPLEMENTATION keyword is processed by the compiler. This takes the form (initially) of an unmodified copy of the INTERFACE hash table, to which symbols are added in the usual fashion. Thus, the hash chains constructed or extended at this time lead naturally to the INTERFACE chains and this is how the superset is effectively implemented. 4.4 DICTIONARY ELEMENTS The dictionary contains four major elements. These are: hash tables, Dictionary Headers, Dictionary Stubs and Type Descriptors. The distinction between Dictionary Headers and Stubs might appear to be rather arbitrary. They might just as easily be regarded as a single element (such as symbol entry). However, the case for the separate entity approach is strong since Stubs are DIRECTLY addressed via LG's and -- more to the point -- ONLY by LG's. Thus, it seems reasonable that this is a separate and very important structure -- at least in the minds of the architects at Borland. 4.4.1 HASH TABLES As has been intimated, Hash Tables are the glue that binds the dictionary entries together and gives the dictionary its "shape". They effectively implement the scope rules of the language and speed access to essential information. Each Hash table begins with a 2-byte size descriptor. This descriptor contains the number of bytes in the table proper (less 2). Thus, the descriptor directly points to the last bucket in the hash table. For a hash table of 128 bytes, the size descriptor contains 126. The first bucket in the table immediately follows the size descriptor. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 15 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.1.1 SIZE So far, three different hash table sizes have been observed. The INTERFACE and DEBUG hash tables are usually 128 bytes (64 entries) in size plus 2 bytes of size description, but the SYSTEM.TPU unit is a special case, containing only 16 entries. Hash tables which anchor subgraphs whose scope is relatively local usually contain four (4) entries (8 bytes). Graphically, a Hash Table with four slots has the following layout: +--------------------+ | 0006h | Size Descriptor |--------------------| | slot 0 | an LL or zero |--------------------| | slot 1 | an LL or zero |--------------------| | slot 2 | an LL or zero |--------------------| | slot 3 | an LL or zero +--------------------+ It should be noted that the Size Descriptor furnishes an upper bound for the hash function itself. Thus, it seems possible that a single hash function is used for all hash tables and that its result is ANDed with the Size Descriptor to get the final result. Because the sizes are chosen as they are (powers of 2) this is feasible. Note that in the above example, 6 = 2 * (n - 1) where n = 4 {slot count}. All of the hash tables observed so far have this property. One final note on this subject. Given these properties, "Folding" of sparse hash tables is a rather trivial exercise so long as the new hash table also contains a number of slots that is a power of 2. This point is intriguing when one recalls that the SYSTEM.TPU hash table has only 16 slots rather than the usual 64. 4.4.1.2 SCOPE The INTERFACE and Debug dictionary hash tables are Global in Scope even though the symbols accessed directly via either hash table may be private. On the other hand, other hash tables are purely local in scope. For example, the fields declared within a record are reached via a small local hash table, as are the arguments and local variables declared within procedures and functions. Even OBJECTS use this technique to provide access to Methods and Object Fields. Access to such local scope fields/methods requires use of qualified names which ensures conformity to Pascal scope rules. The method is truly simple and elegant. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 16 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.1.3 SPECIAL CASES The SYSTEM.TPU Unit is a special case. Its INTERFACE hash table has apparently been "hand-tuned" for small size and it contains only sixteen (16) entries. In addition, the Debug hash table is absent since there is no local symbol generation in this unit. Therefore, the Debug hash table does not exist as a separate entity, its function being served by the INTERFACE hash table. The pointer to the Debug hash table (in the Unit Header) has the same value as the pointer to the INTERFACE hash table. 4.4.2 DICTIONARY HEADERS This is the structure that anchors all information known by the compiler about any symbol. The format is as follows: +00: An LL which points to the next (previous) symbol in the same unit which had the same hash function value. +02: A character that defines the category the symbol belongs to and defines the format of the Dictionary Stub which follows the Dictionary Header. If the symbol is declared in the component list of the "private" part of an Object declaration, then this character is modified by adding $80 to its ordinal value. Thus, an ordinary Function, Procedure or Method is of category "S" while a private Method is of category Chr(Ord('S')+$80). +03: A String (in the Pascal sense) of variable size that contains the text of the symbol (in UPPER-CASE letters only). The SizeOf function is not defined for these strings since they are truncated to match the symbol size. The "value" of the SizeOf function can be determined by adding 1 to the first byte in the string. Thus, Ord(Symbol[0])+1 is the expression that defines the Size of the symbol string. Turbo Pascal defines a symbol as a string of relatively arbitrary size, the most significant 63 characters of which will be stored in the dictionary. Thus, we conclude that the maximum size of such a string is 64 bytes. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 17 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.3 DICTIONARY STUBS Dictionary Stubs immediately follow their respective headers and their format is determined by the category character in the Dictionary Header. The function of the stub is to organize the information appropriate to the symbol and provide a means of accessing additional information such as type descriptors, constant values, parameter lists and nested scopes. The format of each Stub is presented in the following sub-sections. 4.4.3.1 LABEL DECLARATIVES ("O") This Stub consists of a WORD whose function is (as yet) unknown. 4.4.3.2 UN-TYPED CONSTANTS ("P") This Stub consists of (2) two fields: +00: An LG which points to a Type Descriptor (usually in SYSTEM.TPU). This establishes the minimum storage requirement for the constant. The rules vary with the type, but the size of the constant data field (which follows) is defined using the Type Descriptor(s). +04: The value of the constant. For ordinal types, this value is stored as a LONGINT (size=4 bytes). For Floating-Point types, the size is implicit in the type itself. For String types, the size is determined from the length of the string which is stored in the initial byte of the constant. 4.4.3.3 NAMED TYPES ("Q") This Stub consists of an LG (4-bytes) that points to the Type Descriptor for this symbol. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 18 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.3.4 VARIABLES, FIELDS, TYPED CONS ("R") This Stub contains information required to allocate and describe these types of entities. The format and content is as follows: +00: A one-byte flag that precisely identifies the class of the item being described. The known values and their apparent meanings follow: $00 -> Global Variables (Allocated in DS); $01 -> Typed Constants (Allocated in DS); $02 -> Procedure LOCAL Variables on STACK; $03 -> Variables at Absolute Addresses; $06 -> ADDRESS Arguments allocated on STACK; (This is now used only for SELF in Method calls;) $08 -> Fields sub-allocated in RECORDS and OBJECTS, plus METHODS declared for OBJECTS. $10 -> Variable Equivalenced to another via the Absolute Clause; $22 -> Arguments whose VALUEs are passed on the stack; $26 -> Arguments whose ADDRESSes are passed on the stack. +01 Two words whose content vary with the codes above. Their content is explained following the last item in the stub. +05: An LG that locates the proper Type Descriptor for this symbol. When the code byte at +00 is $02,$06,$22 or $26 (arguments), the two words at +01 are used as follows: +01 Word -- Offset relative to either DS or BP. +03 Word -- LL to Dict Header of Parent Scope, or zero. If the code byte is $00 or $01 (VAR's or typed CONSTs), then we have: +01 Word -- Offset relative to allocation area origin; +03 Word -- Offset to entry in VAR/CONST Map for item allocation; When the code byte is $03 (Absolute Address Variable), then we have: +01 DWord -- FAR Pointer to Absolute Memory Address. When the code byte is $08 (Record/Object Fields/Methods), then we have: +01 Word -- Allocation Offset within Record/Object; +03 Word -- LL to next Field/Method. When the code byte is $10 (Absolute Equivalences), then we have: +01 DWord -- LG to STUB of variable/parameter declaration that actually establishes the allocation; ---------------------------------------------------------------------- Rev: April 16, 1991 Page 19 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.3.5 SUBPROGRAMS & METHODS ("S") Subprograms (PROC's), especially since Object Methods are supported, have a rather involved stub. Its format is as follows: +00: A byte that contains bit-switches that seem to describe the Call Model and imply the size of this stub. These switches determine what kind of code (if any) is generated when the PROC is referenced. The observed values are as follows: xxxxx001 -> PROC uses FAR Call Model; xxxx0010 -> PROC uses INLINE Model (no Call); xxxx0100 -> PROC uses INTERRUPT Model (no Call); xxxx100x -> PROC has EXTERNAL attribute; xxx1xxxx -> PROC uses METHOD Call Model; x011xxxx -> PROC is a CONSTRUCTOR Method; x101xxxx -> PROC is a DESTRUCTOR Method; 1xxxxxxx -> PROC has ASSEMBLER directive. +01 A byte whose function is not yet known. (TP Windows?) +02: A Word whose interpretation depends on whether or not we have an INLINE Declarative Subprogram. If this is an INLINE Declarative Subprogram, then this word contains the byte-count of the INLINE code text at the end of this stub. Otherwise, this word is the offset within the PROC Map that locates the object code for this Subprogram. +04: A Word that contains an LL which locates the containing scope in the dictionary, or zero if none. +06: A Word that contains an LL which locates the local Hash Table for this scope. A local hash table provides access to all formal parameters of the Subprogram as well as all Symbols whose declarations are local to the scope of this Subprogram. +08: A Word that is zero unless the symbol is a Virtual Method. In this case, then the content is the offset within the VMT for the owning object that defines where the FAR POINTER to this Virtual Method is stored. +0A: A complete Type-Descriptor for this Subprogram. The length is variable and depends upon the number of Formal Parameters declared in the header. (See 4.4.4.3.5). +??: If this Symbol represents an INLINE Declarative Subprogram, then the object-code text begins here. The byte-count of the text occurs at offset 0002h in this stub. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 20 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.3.6 TURBO STD PROCEDURES ("T") This Stub consists of two bytes, the first of which is unique for each procedure and increments by 4. I have found nothing in the SYSTEM unit (which is where this entry appears) that this seems directly related to. The second byte is always zero. 4.4.3.7 TURBO STD FUNCTIONS ("U") This Stub consists of two bytes, the first of which is unique for each function and increments by 4. I have found nothing in the SYSTEM unit (which is where this entry appears) that this seems directly related to. I wouldn't be surprised if this byte were an index into a TURBO compiler table that points to specialized parse tables/action routines for handling these functions and their non-standard parameter lists. The second byte seems to be a flag having the values $00, $40 and $C0. I strongly suspect that the flag $C0 marks exactly those functions which may be evaluated at compile-time. The meaning behind the other values is not known to me. 4.4.3.8 TURBO STD "NEW" ROUTINE ("V") This Stub consists of a WORD whose function is (as yet) unknown. This is the only Standard Turbo routine that can behave as a procedure as well as a function (returning a pointer value). 4.4.3.9 TURBO STD PORT ARRAYS ("W") This Stub consists of a byte whose value is 0 for byte arrays, and 1 for word arrays. 4.4.3.10 TURBO STD EXTERNAL VARIABLES ("X") This Stub consists of an LG (4-bytes) that points to the Type Descriptor for this symbol. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 21 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.3.11 UNITS ("Y") Unit Stubs have the following content: +00: A Word whose apparently reserved for use by the Compiler or Linker. +02: A Word that seems to contain some kind of "signature" used to detect inconsistent Unit Versions. Borland calls this a "unit version number, which is basically a checksum of the interface part." I have seen a thread in CIS which says that it is a CRC value. Food for thought? +04: A Word that contains an LL which locates the Successor Unit in the "Uses" list. In fact, the "Uses" lists of both the INTERFACE and IMPLEMENTATION sections of the Unit are merged by this Word into a single list. A value of zero is used to indicate no successor. +06: A Word that contains an LL which locates the Predecessor Unit in the "Uses" list. For the SYSTEM unit entry, this value is always zero to indicate no predecessor. For the Unit being compiled, this LL locates the final Unit in the combined "Uses" list. In effect, the two LL's at offsets 0004 and 0006 organize the units into both forward and backward linked chains. The entry for the unit being compiled is effectively the head of both the forward and the backward chains. The final unit in the merged "Uses" list is the tail of the forward chain, and the SYSTEM unit is the tail of the backward chain. 4.4.4 TYPE DESCRIPTORS Type Descriptors store much of the semantic information that applies to the symbols declared in the unit. Implementation details can be managed using high-level abstractions and these abstractions can be shared. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 22 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.1 SCOPE Type Descriptor sharing can occur across the boundaries which are implicit in unit modules. Thus, a type defined in one unit may be "imported" by some other module. Also, the pre-defined Pascal Types (plus the Turbo Pascal extensions) are defined in the SYSTEM.TPU unit and there needs to be a means of "importing" such Type Descriptors during compilation. This is precisely the objective of the LG locator which was described in section 2.2 (above). Type Descriptors are NEVER copied between units. The binding always occurs by reference at compile time and this helps support the technique of modifying a unit and compiling it to a .TPU file, then re-compiling all units/programs that "USE" it. Type Descriptors have many roles so their format varies. We have divided these structures into two parts: The PREFIX Part (which is always present and) whose format is fairly constant and the SUFFIX Part whose content and format depends on the attributes that are part of the type definition. 4.4.4.2 PREFIX PART The Prefix Part of every Type Descriptor consists of six (6) bytes. The usage is consistent for all types observed by this author and the format is as follows: +00: A Byte that identifies the format of the Suffix part. This is essentially based on several high-level categories which the Suffix Parts support directly. The observed set of values is as follows: 00h -> an un-typed entity; 01h -> an ARRAY type; 02h -> a RECORD type; 03h -> an OBJECT type; 04h -> a FILE type (other than TEXT); 05h -> a TEXT File type; 06h -> a SUBPROGRAM type; 07h -> a SET type; 08h -> a POINTER type; 09h -> a STRING type; 0Ah -> an 8087 Floating-Point type; 0Bh -> a REAL type; 0Ch -> a Fixed-Point ordinal type; 0Dh -> a BOOLEAN type; 0Eh -> a CHAR type; 0Fh -> an Enumerated ordinal type. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 23 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- +01: A Byte used as a modifier. Since the above scheme is too general for machine-dependent details such as storage width and sign control, this modifier byte supplies additional data. The author has identified several cases in which this information is vital but has not spent very much time on the subject. The chief areas of importance seem to be in the 8087 Floating-Point types, and the Fixed-Point ordinal types. The semantics seem to be as follows: 0A 00 -> The type "SINGLE" 0A 02 -> The type "EXTENDED" 0A 04 -> The type "DOUBLE" 0A 06 -> The type "COMP" 0C 00 -> an un-named BYTE integer 0C 01 -> The type "SHORTINT" 0C 02 -> The type "BYTE" 0C 04 -> an un-named WORD integer 0C 05 -> The type "INTEGER" 0C 06 -> The type "WORD" 0C 0C -> an un-named double-word integer 0C 0D -> The type "LONGINT" One important feature of the above semantics is the fact that an un-typed CONST declaration refers to the above two bytes to determine the storage space needed in the dictionary for the data value of the constant. This can be a little involved however as the constant may contain its own length descriptor (as in a string) in which case it may be sufficient to identify the high-level type category without any modifier byte. +02: A Word that contains the number of bytes of storage that are required to contain an object/entity of this type. For types that represent variable-length objects/entities such as strings, this word may define the value returned by the SIZEOF function as applied to the type. +04 A Word that is zero unless the descriptor is for an Object Method. In this case, the content is an LL to the Dictionary Header of the SUCCEEDING Method for the Object, in order of declaration, or zero if none. 4.4.4.3 SUFFIX PARTS Suffix Parts further refine the implementation details of the type and also provide subrange constraints where appropriate. In some cases the Suffix part is empty since all semantic data for the type is contained in the Prefix part. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 24 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.3.1 UN-TYPED This Suffix Part is empty. Nothing is known about an un-typed entity. 4.4.4.3.2 STRUCTURED TYPES The structured types represent aggregates of lower-level types. We include ARRAY, RECORD, OBJECT, FILE, TEXT, SET, POINTER and STRING types in this category. 4.4.4.3.2.1 ARRAY TYPES The Suffix Part of the ARRAY type is so constructed as to be able to support recursive or nested definition of arrays. The suffix format is as follows: +00: An LG that locates the Type Descriptor for the "base-type" of the array. This is the type of the entity being arrayed (which may itself be an array). +04: An LG that locates the Type Descriptor for the array bounds which is a constrained ordinal type or subrange. 4.4.4.3.2.2 RECORD TYPES RECORD types have nested scopes. The Suffix part provides a base structure by which to locate the fields local to the scope of the Record type itself. The format is as follows: +00: A Word containing an LL which locates the local Hash Table that provides access to the fields in the nested scope. +02: A Word containing an LL which locates the Dictionary Header of the initial field in the nested scope. This supports a "left-to-right" traversal of the fields in a record. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 25 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.3.2.3 OBJECT TYPES OBJECT types also have nested scopes. The Suffix part provides a base structure by which to locate the fields and METHODS local to the scope of the OBJECT type itself. In addition, inheritance and VMT particulars are stored. The format is as follows: +00: A Word containing an LL which locates the local Hash Table that provides access to the fields and METHODS local to the nested scope. +02: A Word containing an LL which locates the Dictionary Header of the initial field or METHOD in the nested scope. This supports a "left-to-right" traversal of the fields and METHODS in an OBJECT. +04: An LG which locates the Type Descriptor of the Parent Object. This field is zero if there is no such Parent. +08: A Word which contains the size in bytes of the VMT for this Object. This field is zero if the object employs no Virtual Methods, Constructors or Destructors. +0A: A Word which contains the offset within the CONST DSeg Map that locates the VMT skeleton or template segment. This field equals FFFFh if the object employs no Virtual Methods, Constructors or Destructors. +0C: A Word which contains the offset within an Object instance where the NEAR POINTER to the VMT for the object is stored (within the DATA SEGMENT). This field equals FFFFh if the object employs no Virtual Methods, Constructors or Destructors. +0E: A Word which contains an LL which locates the Dictionary Header for the name of the OBJECT itself. +10: A Word (not yet understood) containing $FFFF. +12: Three Words (not yet understood) containing zeroes. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 26 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.3.2.4 FILE (NON-TEXT) TYPES This Suffix consists of an LG that locates the Type Descriptor of the base type of the file. Note that the Type Descriptor may be that of an un-typed entity (for un-typed files). 4.4.4.3.2.5 TEXT FILE TYPES This Suffix consists of an LG that locates the Type Descriptor of the base type of the file -- in this case SYSTEM.CHAR. 4.4.4.3.2.6 SET TYPES This Suffix consists of an LG that locates the base-type of the set itself. Pascal limits such entities to simple ordinals whose cardinality is limited to 256. 4.4.4.3.2.7 POINTER TYPES This Suffix consists of an LG that locates the base-type of the entity pointed at. 4.4.4.3.2.8 STRING TYPES This is a special case of an ARRAY type. The format is as follows: +00: An LG to the Type Descriptor SYSTEM.CHAR which is the base type of all Turbo Pascal Strings. +04: An LG to the Type Descriptor for the array bounds constraints for the string. When the unconstrained STRING type is used, this points to SYSTEM.BYTE which is defined as a subrange 0..255. 4.4.4.3.3 FLOATING-POINT TYPES The Suffix part for all Floating-Point types is EMPTY. All data needed to specify these approximate number types is contained in the Prefix part. The Types included in this class are SINGLE, DOUBLE, EXTENDED, COMP and REAL. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 27 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.3.4 ORDINAL TYPES The Ordinal Types consist of the various "integer" types plus the BOOLEAN, CHAR and Enumerated types. 4.4.4.3.4.1 "INTEGERS" These types include BYTE, SMALLINT, WORD, INTEGER and LONGINT. Their Suffix parts are identical in format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor of the largest upward compatible type. This is the Type Descriptor that is used to control the width of an un-typed constant in the dictionary stub. For the "integer" types, this is an LG to SYSTEM.LONGINT. 4.4.4.3.4.2 BOOLEANS This type Suffix has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor SYSTEM.BOOLEAN. There is no "upward compatible" type. 4.4.4.3.4.3 CHARS This type Suffix has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor SYSTEM.CHAR. There is no "upward compatible" type. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 28 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 4.4.4.3.4.4 ENUMERATIONS This type Suffix is unusual and has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Prefix of the current Type Descriptor. There is no upward compatible type. What follows is a full-fledged SET Type Descriptor whose base type is the Type Descriptor of the Enumerated Type itself. The author has not yet discovered the reason for this. At least one case has been observed where a set type descriptor is followed by a word containing zero but I know of no explanation. Could this be a (shudder) BUG in Turbo? 4.4.4.3.5 SUBPROGRAM TYPES The length of this Suffix is variable. The format is as follows: +00: An LG that locates the Type Descriptor of the FUNCTION result returned by the Subprogram. This field is zero if the Subprogram is a PROCEDURE. +04: A Word that contains the number of Formal Parameters in the Function/Procedure header. If non-zero, then this word is followed by the parameter list itself as a simple array of parameter descriptors. The format of a parameter descriptor is as follows: 0000: An LG that locates the Type Descriptor of the corresponding parameter; 0004: A Byte that identifies the parameter passing mechanism used for this entry as follows: 02h -> VALUE of parameter is passed on STACK, 06h -> ADDRESS of parameter is passed on STACK. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 29 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 5. MAPS AND LISTS The "MAPS and LISTS" are not part of the symbol dictionary. Rather, these structures provide access to the Code and Data Segments produced by the compiler or included via the {$L name.OBJ} directive. The format and purpose (as understood by this author) of each of these tables is explained in the following sections. 5.1 PROC MAP The PROC Map provides a means of associating the various Function and Procedure declarations with the Code Segments. There is some evidence that the Compiler produces CODE (and DATA) Segments for EACH of the Subprograms defined in the Unit as well as for the un-named Unit Initialization code block. There is also evidence that EXTERNAL PROCs must be assembled separately in order to exploit fully the Turbo "Smart Linker" since Turbo Pascal places some significant restrictions on EXTERNAL routines in the area of Segment Names and Types. Specifically, only code segments named "CODE" and data segments named "DATA" or "CONST" will be used by the "Smart Linker" as sources of code and data for inclusion in a Turbo Pascal .EXE file. (Turbo 6.0 relaxed Name constraints but only one code segment per .OBJ remains a limitation). The first entry in the PROC Map is reserved for Unit Initialization block. If there is no Unit Initialization block, this entry will be filled with $FF. In addition, each and every PROC in the Unit has an entry in this table. If an EXTERNAL routine is included, then ALL PUBLIC PROC definitions in that routine must be declared in the Unit Source Code with the EXTERNAL attribute. The size of the PROC Map Table (in Bytes) is implied in the Unit Header by the LL's that occur at offsets +0C and +0E. The Format of a single PROC Map Entry is as follows: +00: A Word presumably reserved as a work area; always zero. +02: A Word presumably reserved as a work area; always zero. +04: A Word that contains an offset within the CSeg Map. This is used to locate the code segment containing the PROC. +06: A Word that contains an offset within the CODE Segment that defines the PROC entry point relative to the load point of the referenced CODE Segment. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 30 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 5.2 CSEG MAP The CSeg Map provides a convenient descriptor table for each CODE Segment present in the Unit and serves to relate these segments with the Segment Relocation Data and the Segment Trace Table. It seems reasonable to infer that the "Smart Linker" is able to include/exclude code/data at the SEGMENT level only. The CSeg Map is an array of fixed-length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). +04: A Word that contains the Length of the Fix-Up Data Table for this Code Segment (in bytes). +06: A Word that contains the offset of the Trace Table Entry for this Segment (if it was compiled with DEBUG Support). If there is no Trace Table for this segment, then this Word contains FFFFh. 5.3 TYPED CONST DSEG MAP The CONST DSeg Map provides a convenient descriptor table for each DATA Segment which was spawned by the presence of Typed Constants or VMT's in the Pascal Code. It serves to relate these segments with the Segment Fix-Up (relocation) Data and with the Code Segments that refer to these DATA elements. One entry is present for each CONST declaration part containing typed constants and for each CONST segment linked from an ".OBJ" file. The CONST DSeg Map is an array of fixed- length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). +04: A Word that contains the Length of the Fix-Up Data Table for this DATA Segment (in bytes). +06: A Word that contains an LL which locates the OBJECT that owns this VMT template or zero if the segment is not a VMT template. One can determine the defining block for a Typed Constant declaration and our program attempts to do just that. A by-product of the dictionary mapping algorithm allows the declaring block to be found and its qualified name printed. This information is also used to explain fix-up data as to its source. Results will be incomplete unless a really comprehensive dictionary is present in the unit. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 31 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 5.4 GLOBAL VAR DSEG MAP The VAR DSeg Map provides a convenient descriptor table for each DATA Segment present in the Unit. One entry exists for each CODE segment which refers to GLOBAL VAR's allocated in the DATA Segment. These references may be seen in the Fix-Up Data Table. Each EXTERNAL CSeg having a segment named DATA also spawns an entry in this table. Only the Code Segments that meet these criteria cause entries to be generated in the VAR Dseg Map. The VAR DSeg Map is an array of fixed-length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). This may be zero, especially if the EXTERNAL routine contains a DATA segment whose sole purpose is to declare one or more EXTRN symbols that are defined in some DATA segment external to the Assembly. +04: A Word apparently reserved for use by TURBO. +06: A Word apparently reserved for use by TURBO. One can determine the defining block for a Global VARiable declaration and our program attempts to do just that. A by-product of the dictionary mapping algorithm allows the declaring block to be found and its qualified name printed. This information is also used to explain fix-up data as to its source. Results will be incomplete unless a really comprehensive dictionary is present in the unit. Such DSegs can be referenced by many CSegs and we only locate the first one. This is okay for Pascal code but it's ambiguous for assembler since the names may be PUBLIC and referenced by more than one module. 5.5 DONOR UNIT LIST This list contains an entry for each Unit (taken from the "USES" list) which MAY contribute either CODE or DATA to the executable file. Not all units do make such a contribution as some exist merely to define a collection of Types, etc. A Unit gets into this list if there exists a single Fix-Up Data Entry that references CODE or DATA in that Unit. The list is comprised of elements whose SIZE is variable and whose format is as follows: +00: A WORD apparently reserved for use by TURBO. +02: A variable-length String containing the unit name. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 32 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 5.6 SOURCE FILE LIST This list contains an entry for each "source" file used to compile the Unit. This includes the Primary Pascal file, files containing Pascal code included by means of the {$I filename.xxx} compiler directive, and .OBJ files included by the {$L filename.OBJ} compiler directive. The order of entries in this list is critical since it maps the CODE segments stored in the unit. The order of the entries is as follows: 1) The Primary Pascal file; 2) All Included Pascal files; 3) All Included .OBJ files. Mapping of CSegs to files is done as follows: a) Each .OBJ file contributes a SINGLE Code Segment (if any). Note that this author has not observed an .OBJ module that contains only a DATA Segment (but that seems a distinct possibility). b) The Primary Pascal file (augmented by all included Pascal Files) contributes zero or more CODE Segments. Therefore, there are at least as many CSeg entries as .OBJ files. If more, then the excess entries (those at the front of the list) belong to the Pascal files that make up the Pascal source for the unit. The format of an entry in this list is as follows: +00: A flag byte that indicates the type of file represented; 04h -> the Primary Pascal Source File, 03h -> an Included Pascal Source File, 05h -> an .OBJ file that contains a CODE segment. +01: A Word apparently reserved for use by the Compiler/Linker. +03: A Word that is zero for .OBJ files and which contains the file directory time-stamp for Pascal Files. +05: A Word that is zero for .OBJ files and which contains the file directory date-stamp for Pascal Files. +07: A variable-sized string containing the filename and extension of the file used during compilation. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 33 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 5.7 DEBUG TRACE TABLE If Debug support was selected at compile time, then all Pascal code which supports Debugging produces an entry in this table. The table entries themselves are variable in size and have the following format: +00: A Word which contains an LL that locates the Directory Header of the Symbol (a PROC name) this entry represents. +02: A Word which contains the offset (within the Source File List) of the entry that names the file that generated the CSeg being traced. This allows the file included by means of the {$I filename} directive to be identified for DEBUG purposes, as well as code produced from the Primary File. +04: A Word containing the number of bytes of data that precede the BEGIN statement code in the segment. For Pascal PROCS these bytes consist of literal constants, un-typed constants, and other data such as range-checking limits, etc. +06: A Word containing the Line Number of the BEGIN statement for the PROC. +08: A Word containing the number of lines of Source Code to Trace in this Segment. +0A: An array of bytes whose size is at least the number of source code lines in the PROC. Each byte contains the number of bytes of object code in the corresponding source line. This appears to be an array of SHORTINT since if a "line" contains more than 127 bytes, then a single byte of $80 precedes the actual byte count as a sort of "escape" and the next byte records the up to 255 bytes for the line. This situation has not yet been fully explored. We do not yet know what happens in the event a line is credited with spawning more than 255 bytes of code. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 34 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 6. CODE, DATA, FIX-UP INFO This area begins at the start of the next free PARAGRAPH. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. This area contains the CODE segments, CONST DATA segments, and the Relocation (Fix-Up) Data required for linking. 6.1 OBJECT CSEGS Each CODE segment included in the unit appears here as specified by the CSeg Map Table. Depending on usage, these segments may appear in the executable file. There are no filler bytes between segments. 6.2 CONST DSEGS This section begins at the start of the first free PARAGRAPH following the end of the Object CSegs. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. A DATA segment fragment appears here for each CSeg that declares a typed constant, and for each OBJECT which employs Virtual Methods, Constructors or Destructors. There are no filler bytes between segments. If local symbols were generated, there is always enough information to allow documenting the scope of the declaration as well as interpreting the data in the display since the needed type declarations would also be available. Our program merely identifies the defining block. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 35 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 6.3 FIX-UP DATA TABLE This table begins at the start of the first free PARAGRAPH following the end of the CONST DSegs. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. There are two sections in this table: one for code, and one for data. Both sections are aligned on paragraph boundaries. This may result in a "slack" entry between the code and data sub-sections, but this entry is included in the byte tally for the section stored in the Unit Header Table at UHZFA (offset +22). The table begins with entries for the CSeg Map and ends with entries for the CONST DSeg Map. The appropriate Map entry specifies the number of bytes of Relocation Data for the corresponding segment. This number may be zero in which case there is no Relocation Data for the given segment. The Table consists of an array of eight (8) byte entries whose format is as follows: +00: A Byte containing the offset within the Donor Unit List of the Unit name that this entry refers to. This can be the compiled Unit or some previously compiled external unit. +01: A Byte of BIT switches that identify the type of reference and the size of the needed fix-up (WORD or DWORD). A lot of guess-work led to the following interpretation: 7654 (bits 3-0 don't seem to be used) 00-- Locate item via a PROC Map, 01-- Locate item via a CSeg Map, 10-- Locate item via a Global VAR DSeg Map, 11-- Locate item via a Const DSeg Map, --00 WORD offset has NO effective address adjustment, --01 WORD offset HAS an effective address adjustment, --10 WORD SEGMENT-Only fix-up (address of some PUBLIC segment), --11 DWORD (FAR) pointer; possible effective address adjustment. +02: A Word containing the offset within the Map table referenced according to the above code scheme. +04: A Word containing an offset within the target segment which will be added to the effective address. For example, a reference to the VAR DSeg Map will require a final offset to locate the item (variable) within the DATA SEGMENT being referenced here. This may also be needed for references to LITERAL DATA embedded in a CODE SEGMENT. +06: A Word containing the offset within the CODE or DATA segment owning this entry that contains the area to be patched with the value of the final effective address. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 36 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 7. SUPPLIED PROGRAM In order that the above information be made constructively useful, the author has designed a program that automates the process of discovery. It is not a "handsome" program and it is not a work of art. It does give useful results provided your PC has enough available memory. It should be obvious that the program was not designed "top-down". Rather, it just evolved as each new discovery was made. Later on, it seemed reasonable to try to document some of the relations between the various lists and tables and the program tries to make some of these relations clear, albeit with varying degrees of success. 7.1 TPU6 This is the main program. It will ask for the name of the unit to be documented. Reply with the unit name only. The program will append the ".TPU" extension and will search for the proper file. It will also search TURBO.TPL if necessary. The program will then ask if Dis-Assembly is desired and will require a "y" or "n" answer. If "y", it also asks about the CPU. The current directory will be searched first, followed by all directories in the current PATH. If the .TPU file is not found, the program will search for it in the "TURBO.TPL" (Turbo Pascal Library) file. Units in the "USES" list(s) will also be loaded to enable resolution of LG items. If the desired unit is found, the program will write a report to the current directory named "unitname.lst" which contains its analysis. The format of the report is such that it may be copied to a printer if that printer supports TTY control codes with form-feeds. Be judicious in doing this however since there can be a lot of information. The Turbo SYSTEM.TPU unit file produces almost ninety (90) pages without the disassembly option. When disassembly is requested for the SYSTEM unit, the size of the output file exceeds 700K bytes. 7.1.1 UNIT TPU6AMS This Unit contains all Type Definitions, Structures, and primitive Functions and Procedures required by the program. All structures documented in this report are also documented in TPU6AMS by means of the TYPE mechanism. Some of the structures are difficult if not impossible to handle using ISO Pascal but Turbo Pascal provides the means for getting the job done. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 37 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 7.1.2 UNIT TPU6EQU This Unit is new and contains constants and types of general utility that are not strictly unit related. It also constains the pointer manipulation routines that are sensitive to the particular version of Turbo Pascal Version 6.0. It also contains a Heap Error Function that keeps track of the high-water mark of Heap Utilization of any program that uses it. This function gets installed automatically. 7.1.3 UNIT TPU6UTL This Unit is new. It contains the higher-level analysis algorithms formerly located in the main program and in TPU6AMS. The algorithms have been re-cast with object-orientation in mind and have potential for re-use in other contexts. The unit computes a cover for the dictionary and deduces relationships between dictionary, code, data and the CSeg, PROC, CONST and VAR Maps discussed in section 5. This information is retrieved by the main program to drive the printing process. This Unit also loads all units specified in the USES list of the prime unit to allow the names of externally defined types to be recovered on the report. Array bounds are also retrieved in this way. The code will search for needed units in TURBO.TPL without intervention. Close attention is paid to Heap Management and minimal utilization of Heap storage. The dictionary areas of the Units located in the USES list get loaded into the Heap at no extra charge. Nothing but the dictionary area is of any use at this point. The name and fully- qualified file name of each unit successfully loaded are printed at the top of the listing. Unit version numbers must agree or the unit will not be loaded. Dictionary covers are computed for each loaded unit to aid in rapid LG-resolution. 7.1.4 UNIT TPU6RPT This is a Unit that contains the text-file output primitives required by the main program. It's not very pretty but it does work. 7.1.5 UNIT TPU6UNA This unit is a rudimentary disassembler. The output will not assemble and may look strange to a "real" assembler programmer since I am not well-qualified in this area. However, the basis for support of 80286, 80386 etc. processors is present as well as coprocessor support. Of perhaps the greatest interest is that it does appear to decode the emulated coprocessor instructions that are implemented via INT 34-3D. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 38 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- Be warned however. The output is not guaranteed since this was coded by myself and I am perhaps the rankest amateur that ever approached this quite awful assembler language. For convenience, the operand coding mimics TASM "Ideal" mode. As is usual with programs of this type, error-recovery is minimal and no context checking is performed. If the operation code is found to be valid, then a valid instruction is assumed -- even if invalid operands are present. The only positives that apply to this program are that it doesn't slow the cpu down (although a lot more output is produced), and it does let one "tune" code for compactness by letting one view the results of the coding directly. Also, incomplete instructions are handled as data rather than overrunning into the next proc. 7.2 MODIFICATIONS It was intended from the beginning that this program should be able to be enhanced to permit external units to be referenced during the analysis of any given unit, even if they were library components. Since the original release of this document, the program has been so- enhanced. This program was NOT intended as a pilot for some future product. It WAS intended as a rather "ersatz" tool for myself. 7.3 NOTES ON PROGRAM LOGIC The following sections discuss a few of the methods employed by the supplied program. 7.3.1 FORMATTING THE DICTIONARY Printing the unit dictionary area in a way that exposes its underlying semantics is no small task. The unit dictionary area itself is a rather amorphous-looking mass of data composed of hash tables, dictionary headers and stubs, type descriptors, etc. In order to present all this information in a meaningful way, we have to reveal its structure and this cannot be done by means of a sequential "browse" technique. Rather, we have to visit all nodes in the dictionary area so that each may be formatted in a way that exposes their function and meaning. This is made necessary by the fact that items are added to the dictionary as encountered and no convenient ordering of entry types exists. What we have here is the problem of finding a minimal "cover" for the dictionary area that properly exposes the content and structure of the dictionary area. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 39 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- To do this, we construct (in the heap) a stack and a queue, both of which are initially empty. The entries we put in the stack identify the class of entry (Hash Table, Dictionary Header, Type Descriptor or In-Line Code group), the location of the structure, and the location of its immediate "owner" or "parent" dictionary entry (which allows some limited information about scope to be printed). To the empty stack, we add an entry for the unit name dictionary entry, the INTERFACE hash table, and the Debug hash table. All these are located via direct pointers (LL's) in the Unit Header Table. We then pop one entry off the stack and begin our analysis. a) If the entry we popped off the stack is not present in the queue, we add it and call a routine that can interpret the entry (aka, "cover") for a Dictionary Header, Hash Table, or Type Descriptor. (This may lead to additional entries being added to the stack such as nested-scope hash tables, Dictionary Headers, Type Descriptors or In-Line Code group entries.) b) While the stack is not empty, we pop another entry and repeat step "a" (above) until no more entries are available. The result is a queue containing one entry for each structure in the unit dictionary area that is identifiable via traversal. (In practice, the method we use is similar to a "breadth-first" traversal of an n-way tree that is implemented in non-recursive fashion.) Each entry in the queue contains the information described above and the queue itself thus forms a set of descriptors that drive the process of formatting the dictionary area for display. The process may be likened to "painting by the numbers" or to finding a way to lay tile on a flat surface using tiles of four different irregular shapes until the floor is exactly covered. There is one significant limitation that needs to be pointed out. It is not always possible to determine the "parent" or "owner" of a node with certainty. The following discussion illustrates the problem of finding the "real" parent of a Type Descriptor. Almost every "type" in Turbo Pascal is actually derived from the basic types that are defined in the SYSTEM.TPU unit -- e.g. "INTEGER", "BYTE", etc. In addition, several of the Type Descriptors in the SYSTEM unit are referenced by more than one Dictionary Entry. Thus, we find that a "many-to-one" relationship may exist between Dictionary Entries and Type Descriptors. How does one find out which is the entry that actually gave rise to the Type Descriptor? ---------------------------------------------------------------------- Rev: April 16, 1991 Page 40 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- The Dictionary Area of a unit has some special properties, one of which is the fact that the Dictionary Entries for named Types are often located quite near their primary type descriptors. The Dictionary Area seems to be treated as an upward growing heap with the various structures being added by Turbo as encountered. This makes it likely that the Type "Q" header which gives rise to a type descriptor is quite likely to occur earlier in the Dictionary Area than any other header which refers to the same descriptor. We take advantage of this property to allocate "ownership" but it may not be "fool-proof". Some type descriptors are spawned by other type descriptors, especially for structured types. We don't attempt to allocate "ownership" to these "lower-level" descriptors but we do try to keep track of scope information. A useful by-product of the above process is the ability to discover many of the associations between Global Variables, Typed CONST's, VMT's and the blocks in which they are declared or defined. 7.3.2 THE DISASSEMBLER To start with, I apologize up front for mistakes which are bound to be present in this routine. I am not really a MASM or TASM programmer and I will not pretend otherwise. This being the case, the formatting I have chosen for the operands may be erroneous or misleading and might (if submitted to one of the "real" assemblers) produce object code quite different from what is expected. I hope not, but I have to admit it's possible. My intention in adding this unit was to support hand-tuning of object code. With practice and some effort, one can observe the effect on the object module caused by specific Pascal coding. Thus, where compactness or speed is an issue of paramount importance, TPU6UNA can be of help. In some cases, a simple re-arrangement of the local variable declarations in a procedure can have a significant effect on the size of the code if it means the difference between 1 and 2-byte displacements for each instruction that references a specific local variable. Potential applications along these lines seem almost unlimited. I adopted an operand format not unlike that of TASM "Ideal" mode since it was more convenient to do so and looked more readable to me. I relied on several reference books for guidance in decoding the entire mess and I found that there were several flaws (read ERRORS) in some of them which made the job that much more difficult. I then compounded my problems by attempting to handle 80386 specific code even though Turbo Pascal does not yet generate code specific to these processors. I simply felt that the effort involved in writing any sort of Dis-Assembly program for Turbo Pascal units was an effort best experienced not more than once. With all this self-flagellation out of my system once and for all, I will try to show the basic strategy of the program and to explain the limitations and some of the discoveries I made. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 41 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- The routine is intended to be idiotically simple - i.e., no smarter than the DEBUG command in principle. The basic idea is: pass some text to the routine and get back ONE line derived from some prefix of that text. Repeat as necessary until all text is gone. Thus, there is no attempt to check the context of the text being processed. Also, some configurations of the "modR/M" byte may invalid for selected instructions. I don't try to screen these out since the intent was to look at the presumably correct code produced by TURBO Pascal -- not devious assembly language. Also, this program regards WAIT operations as "stand-alone" -- i.e., it doesn't check to see if a coprocessor operation follows for which the WAIT might be regarded as a prefix. One area of real difficulty was figuring out the Floating-Point emulations used by Turbo Pascal that are implemented by means of interrupts $34 through $3D. I don't know if I got it right, but the results seem reasonable and consistent. In the listing, the Interrupt is produced on one line, followed by its parameters on the next line. The parameter line is given the op-code "EMU_xxxx" where "xxxx" is the coprocessor op-code I felt was being emulated. Interrupt $3C was a real puzzler but after seeing a lot of code in context, I think that the segment override is communicated to the emulator by means of the first byte after the $3C. Normally, in a non-emulator environment, all coprocessor operations (ignoring any WAIT prefixes) begin with $D8-$DF. What Borland (and maybe Microsoft) seem to have done here is to change the $D8-$DF so that bits 7 and 6 of this byte are replaced with the one's complement of the 2-bit segment register number found in various 8086 instructions. This seems to be how an override for the DS register is passed to the emulator. I don't KNOW this to be the correct interpretation, but the code I have examined in context seems to work under this scheme, so TPU6UNA uses it to interpret the operand accordingly. For 80x86 machines, the problem was somewhat simpler. TPU6UNA takes a quick look at the first byte of the text. Almost any byte is valid as the initial byte of an instruction, but some instructions require more than one byte to hold the complete operation code. Thus, step 1 classifies bytes in several ways that lead to efficient recognition of valid operation codes. Once the instruction has been identified in this way, it is more or less easy to link to supplemental information that provides operand editing guidance, etc. The tables that embody the recognition scheme were constructed using PARADOX 3.0 (another fine Borland product) and suitably coded queries were used to generate the actual Turbo Pascal code for compilation. For those that are interested, TPU6UNA supports the address-size and operand-size prefixes of the 80386 as well as 32-bit operands and addresses but remember that Turbo Pascal doesn't generate these. A trivial change is provided for which allows segments which default to 32-bit mode to be handled as well. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 42 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- There is a simple mode variable that gets passed to TPU6UNA by its caller which specifies the most-capable processor whose code is to be handled. Codes are provided for the 8086 (8088 is the same), 80186 (same as 80286 without protected mode instructions), 80286 (80186 plus protected mode), and 80386. You now get asked which one to use. No such specifier is provided for coprocessor support. What is there is what I think an 80387 supports. I don't think that this is really a problem if you don't try to use TPU6UNA for anything but Turbo Pascal code. Error recovery is predictably simple. The initial text byte is output as the operand of a DB pseudo-op and provision is made to resume work at the next byte of text. I hope this program is found to be useful in spite of the errors it must surely contain. I have yet to make much sense of the rules for MASM or TASM operand coding and I found very little of value in many of the so-called "texts" on the subject. I found myself in the position of that legendary American in England watching a Cricket match for the first time ("You mean it has RULES?"). 8. UNIT LIBRARIES I have examined .TPL files in passing and feel that their structure is trivial. It's so easy to handle them that the program now routinely examines TURBO.TPL to resolve named types. 8.1 LIBRARY STRUCTURE A Turbo Pascal Library (.TPL) file is a simple catenation of Turbo Pascal Unit (.TPU) files. Since the length of a Unit may be determined from the Unit Header (see section 3.1), it is simple to see that one may "browse" through a .TPL file looking for an external unit such as SYSTEM.TPU. The supplied program does just that in its unit retrieval process so the TPUMOVER utility is no longer required for processing of units in TURBO.TPL ---------------------------------------------------------------------- Rev: April 16, 1991 Page 43 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 9. APPLICATION NOTES One of the more obvious applications of this information would seem to be in the area of a Cross-Reference Generator. There is a very fine example of such a program in the public domain that was written by Mr. R. N. Wisan called "PXL". This program has been around since the days of Turbo Pascal Version 1. The program has been continually enhanced by the author in the way of features and for support of the newer Turbo Pascal versions. It does not however solve the problem of telling one which unit contains the definition of a given symbol. In fairness to "PXL" however, this is no small problem since the format of .TPU files keeps changing (Turbo 6.0 Units are not object-code compatible with Turbo 5.x Units, and so on...) and Mr. Wisan probably has more than enough other projects to keep himself occupied. However, for the user who is willing to work a little (maybe a lot?), this document would seem to provide the information needed to add such a function to his own pet cross-reference generator. Further, with SIGNIFICANTLY more effort, it should be possible to do much of the job of de-compilation -- provided the DEBUG dictionary is present. At the very least, most declarations should be recoverable. It's another thing entirely to try to reconstruct plausable TURBO Pascal code from the CSegs. This would be a formidable task and lots of knowledge about TURBO's code generators would have to be acquired. At present, the only way I know to get this information is to have the run-time library source codes and then work-work-work at testing code produced by the compiler for a huge number of test case units. You have to want to do this really badly in order to invest the time. I am not that tired of living. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 44 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 10. ACKNOWLEDGEMENTS This project would have been totally infeasible without the aid of some very fine tools. As it was, several hundred man hours have been expended on it and as you can see, there are a few unresolved issues that have been (graciously) left for others to address. The tools used by this author consisted of: 1) Turbo Pascal 6.0 Professional by Borland International 2) Microsoft WORD (version 5.0) 3) LIST (version 7.5) by Vernon D. Buerg 4) the DEBUG utility in MS-DOS Version 3.3. 5) PARADOX 3.0 by Borland International 6) QUATTRO PRO by Borland International 7) TURBO ASSEMBLER 1.1 by Borland International (PARADOX and QUATTRO PRO were used for data collection and analysis in the course of coding the recognizer tables for the disassembler unit.) The references listed were of great value in this project. [Intel85] was a valuable source of information about coprocessor instructions as well as offering hints about the differences between the 8086/8088 and the 80286. The [Borland] TASM manuals offered further info on the 80186. [Nelson] provided presentations of well-organized data directed at the problem of disassembly but the tables were flawed by a number of errors which crept into my databases and which caused much of the extra debugging effort. [Intel89] offered valuable insights on the 80386 addressing schemes as well as the 32-bit data extensions. Finally, [Brown] provided valuable clues on the Floating-Point emulators used by Borland (and Microsoft?). As you can see, the amount of hard information available to me on this project was quite limited since I am unaware of any other existing body of literature on this subject. That's it folks. Does anyone wonder why it took several hundred man hours to get to this point? It took a lot of hard (and at times tedious) work coupled with a great many lucky guesses to achieve what you see here. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 45 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- 11. REFERENCES [Borland], TURBO ASSEMBLER REFERENCE GUIDE, Borland International, 1988. [Borland], TURBO ASSEMBLER USER'S GUIDE, Borland International, 1988. [Borland] TURBO PASCAL 6.0 PROGRAMMING GUIDE, Borland International, 1990. [Borland] TURBO PASCAL LIBRARY REFERENCE Version 6.0, Borland International, 1990. [Borland] TURBO PASCAL USER'S GUIDE Version 6.0, Borland International, 1990. [Brown], INTER191.ARC, Ralf Brown, 1991 [Intel85], iAPX 286 PROGRAMMER'S REFERENCE MANUAL INCLUDING THE iAPX 286 NUMERIC SUPPLEMENT, Intel Corporation, 1985, (order number 210498-003). [Intel89], 386 SX MICROPROCESSOR PROGRAMMER'S REFERENCE MANUAL, Intel Corporation, 1989, (order number 240331-001). [Nelson] THE 80386 BOOK: ASSEMBLY LANGUAGE PROGRAMMER'S GUIDE FOR THE 80386, Ross P. Nelson, Microsoft Press, 1988. [Scanlon], 80286 ASSEMBLY LANGUAGE ON MS-DOS COMPUTERS, Leo J. Scanlon, Brady 1986. ---------------------------------------------------------------------- Rev: April 16, 1991 Page 46 Inside TURBO Pascal 6.0 Units ---------------------------------------------------------------------- INDEX .OBJ file 12, 13, 30, 31, 33 .TPL file 6, 14, 37, 38, 43 .TPU file 5, 7, 11, 14, 23, 37, 43, 44 size 14 SYSTEM 6, 16, 17, 18, 23, 37, 40, 43 Assembler 6 Attribute ABSOLUTE 7 EXTERNAL 20, 30 Call Model ASSEMBLER 20 FAR 20 INLINE 20 INTERRUPT 20 CONST 6, 11, 12, 13, 19, 24, 26, 31, 35, 36, 38 Constraint 28, 29 CSeg 6, 11, 12, 30, 31, 32, 33, 34, 35, 36, 38 Defining block 31, 32 Directive 12, 13, 14, 20, 30, 33, 34 External 7, 30, 32, 36, 39, 43 Hash 11, 12, 13, 14, 15, 16, 17, 20, 25, 26, 39, 40 Include 33, 34 Interface 6, 11, 12, 13, 14, 15, 16, 17, 22, 40 Locator LG 7, 10, 18, 19, 21, 23, 25, 26, 27, 28, 29 LL 7, 11, 16, 22, 30, 40 offset 7, 9, 10, 19, 20, 26, 30, 31, 34, 35, 36 Method 20 CONSTRUCTOR 20 DESTRUCTOR 20 Self 19 Operand offset 36 Parameter 18, 19, 20, 21, 29 PROC 6, 11, 12, 20, 30, 34, 36, 38, 39 SEGMENT 36 Signature 5, 22 Stub 7, 17, 18, 19 Type Descriptor 18, 19, 21, 23, 25, 26, 27, 28, 29, 40, 41 VAR 32, 38 VMT 12, 13, 20, 26, 31 ---------------------------------------------------------------------- Rev: April 16, 1991 Page 47