Bulletin #118 Clarion Index and Key Files Overview This bulletin explains the layout of Clarion index and key files. Clarion Index and Key Files While Technical Bulletin #117, "Clarion Data File," covered the layout of Clarion Professional Developer's data files, it only touched on the subject of the index and key files that Professional Developer uses to access your data in a specific order. The purpose of this technical bulletin is to explain how Clarion's key and index files are laid out. This technical bulletin assumes a knowledge of basic data structures and the C programming language. While you don't have to be a C guru to understand the material presented here, a passing knowledge helps, as C makes it easy to create many types of data structures. As far as this discussion is concerned, there are NO internal differences between index and key files. Index files are simply key files that are not updated automatically by adding, deleting, of changing records in your data base. Also, this bulletin will not explain how you create key files, only how to read them. Our method of managing B+ Trees is proprietary and quite complicated. In all fairness, we can only support products that create key files using our own code. When you see a non-Clarion product that creates and/or maintains Clarion key files, it means they have licensed our code for use in their product. Let's Begin Computers are very good at storing many types of information. However, in today's business world, just being able to store a lot of data isn't good enough; accessing this data quickly is of primary importance. It doesn't do you any good for your computer to store 1,000,000 records if it takes 30 minutes to find a specific record. Clarion products use a method of indexing called "B+ Trees" (pronounced "Bee Plus Trees"). B+ Trees are a modification of another indexing method called "B-Trees," which were created by a man named R. Bayer in 1970. Let's start by seeing what a B-Tree is and how it works. The Ubiquitous B-Tree (with apologies to Doug Comer) A B-Tree is a data structure that, when drawn on paper, resembles an upside- down tree. B-Trees consist of a ROOT node, zero or more interior NODES, and zero or more LEAVES. A typical tree would look like this: ROOT ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ ³ NODE NODE ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ ³ ³ ³ ³ ³ LEAF LEAF LEAF LEAF LEAF LEAF In actuality, all pieces of the tree are NODES, but they are given special names because of properties unique to each one. In any B-Tree, there is only one ROOT node. It is where the tree starts. LEAF nodes are nodes that have no child nodes. All other nodes are called "interior" nodes, or just NODES. Nodes contain pairs of key-pointer structures. The key will contain whatever data you've built tree from and the pointer will either be a reference to another node or to data in your data base. You read the tree by starting at the root and looking at the key values. If the key you have is less than then first key in the root, you take the first branch. If the key you have is greater than or equal to the first key in the root, but less than second, take the second branch, and so on. When you take a branch, continue with the same operation. When you finally reach the leaf level, you'll either find a match (meaning a record with a key you are looking for exists in the data base) or you won't. If you do find a match, the pointer in the leaf will tell you where in the data base the marching record is. The nice thing about B-Trees is that if you run into the "not found" condition, you are currently at the place in the tree where the key would be added is you wanted to add it to the tree. However, you can't just create a data structure that looks like this and call it a B-Tree. B-Trees have specific properties, several of which make the B-Tree as efficient as it is. For example, B-Trees are balanced trees, which means that no matter what data you add to the tree, it always takes the same number of accesses to retrieve a record. Let's look at the properties of a B-Tree. A B-Tree is a B-Tree of "order" X if the following if true: 1) The ROOT has at least two children, unless the tree is empty, in which case the ROOT node has no children at all. 2) Each node has a maximum of X children, but each node IS AT LEAST half full (this does not apply to the ROOT or the LEAVES). This means that all interior nodes have at least X/2 children, but no more than X children. For example, if a B-Tree is of order 6, then all interior nodes will have at least 3 children, but no more than 6. 3) Each path in the tree from the ROOT to a particular LEAF is the same length. Notice that this discussion keeps mentioning "child nodes." This is quite common in B-Trees. Nodes can have children. They also have parent nodes. You can carry this even farther with "grandparent" and "grandchild" nodes. All nodes (with the exception of the ROOT) have a parent. All nodes (with the exception of the LEAVES) have children. All nodes are linked together so you can traverse the tree. In a typical B-Tree, only the leaves actually point to data; the rest of the nodes point to other nodes. Perhaps you are wondering what the difference between a B-Tree and a B+ Tree is. The main difference is that a B+ Tree has links to the left and right between the leaves. Thus, the previous diagram would change a little: ROOT ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ ³ NODE NODE ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ ³ ³ ³ ³ ³ LEAF ÄÄÄÄÄ LEAF ÄÄÄÄÄÄ LEAF LEAF ÄÄÄÄÄÄ LEAF ÄÄÄÄÄÄ LEAF If you want to add a key to a tree, first find our where in the tree your new key should go; then insert the key into the proper place in the tree. However this procedure is not as simple as it sounds. Remember, B+ Trees have specific rules regarding their layout. You can't have more than "X" keys per node, where "X" is the order of the tree. Thus, if adding a key to a node causes it to overflow, you must then "split" the node into two parts. This means, for example, that adding a key to a leaf mode might cause the node to overflow, thus making you split the leaf into two new leaves. This split of the leaves will cause a new entry to be make into the parent of the leaf. If this causes the parent to overflow, you must then split the parent. You can see that this could propagate up the entire tree. Indeed, some additions will cause splits all the way to the root node, sometimes causing a new root to be created. This "cascading" effect can also happen during deletions. For example, if you delete a key in a node and this causes the node to have less than X/2 entries in it, you then try to combine the node with less than X/2 entries with some neighbor node that has X/2 entries. Let's say you do this and you have to combine two neighboring leaves. When you combine the leaves, you will remove an entry from the parent node of the leaves, since the parent il have pointer to both leaves. Again, let's say that this causes the parent node to have less than X/2 keys, thus causing the parent node to be combined with some other parent node. Like additions, deletions also can result in reorganization of the entire tree. But it is this reorganization that keeps the tree" balanced." Clarion Index/Key Files That's enough theory. Let's take a look at Clarion key files and see what they look like. Clarion key files consist of a header and a sequence of zero or more nodes. Empty key files contain only a header. Once you add a key, nodes will be added to the key file. The key file itself tells you nothing about the fields that make up the keys. For that, you need to look at the data file. The key file is read and written in blocks of 512 bytes. These 512-byte units are referred to as "nodes," except when talking about the header. If, for some reason, a node is not full, extra bytes are set to 0. First, the header. Here is how the header is laid out: struct { unsigned long root; /* number of root node */ unsigned long numkent; /* number of key entries */ unsigned long numnode; /* number of node for this index */ unsigned long lastnod; /* node number of last node */ unsigned long keyeof; /* record number of end of file */ unsigned long keybof; /* record number of beg. of file */ unsigned long unused; /* first unused node of file */ unsigned char keytyp; /* type of key */ unsigned char keynode; /* number of keys per node */ unsigned keylen; /* total length of key entry */ unsigned numlvls; /* number of levels */ char cvoid[477]; /* reserved space */ }; The "cvoid" variable is used to fill the header out to a length of 512 bytes. The only variable in the header that is not a true numeric value is the "keytyp" variable. It is a bitmap that tells you information about the key itself. It is defined as follows: struct { unsigned char keycnt; /* number of keys in this nodes */ unsigned long flink; /* forward node pointer */ unsigned long blink; /* backwards node printer */ unsigned long ulink; /* upwards node pointer */ }; After the node header comes the pointer/key combinations themselves. They look like this: struct { unsigned long relrec; /* record number/node number */ char key[]; /* key value (key size - size of long)*/ }; The number of pointer/key combinations in a node is determined when the key file is initially created. You can't have less than 2 pointer key entries per node; thus, if you have 512 bytes available and yo u allocate 13 bytes for a node header AND you have to give up 4 bytes for the pointer (relrec) in each pointer/key entry, that leaves you with a maximum key size of 245 bytes [(512 - 13 (header) - 8 (size of 2 longs)) / 2]. This is also the method that determines how may pointer/key combinations to put in a node. Reading a key file is fairly simple. Once you have determined from the header how many pointer/key entries are in each node and how long each key is, you preform the following steps: 1) Start at the root node, which is located at offset "root" times 512. 2) Initialize a counter to keep track of what "level" you are at in the key file; set this counter to 1. 3) Look at the pointer/key structures until you find a key that is greater than or equal to the key you are looking for. 4) Once you have found this key, compare the "numlvls" variable to your counter. If they match, you are at the leaf level; the pointer is an actual record number in your data file; otherwise it is a node number. 5) Read the record if found in step 4. 6) If the pointer is an actual record number, you have found the record you need to look at in the data file. If not, go read the node the pointer refers to, increment your level counter, and go back to step 3. When your level counter is equal to "numlvls," you are at the leaf of the tree. If the key you are seeking is not at the leaf level, then your key is not in the key file. An example should help clear up any confusion. Let's look at a sample key file for a database of state abbreviations. The key is 2 characters long (STRING(2)) and is the standard 2 character state abbreviation. The data file contains the actual state name that corresponds to the 2 character abbreviation. To make this example use more than one node in a key file, some fictitious state abbreviations were added. These are the abbreviations AA, BB, CC, DD, etc. and 00, 11, 22, 33, 44, 55, 66, 77, 88, and 99. First, here is the header: OOOOO: 03 00 00 00 57 00 00 00 03 00 00 00 01 00 00 00 |....W...........| 00010: 4D 00 00 00 57 00 00 00 00 00 00 00 60 53 01 06 |M...W.......'S..| 00020: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| .......................................... .......................................... .......................................... OO1FO: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| Some of the header was omitted because it contains nothing but 00. Here is a breakdown of the fields in the header: (0000) root the "root" node of the tree (03 00 00 00); we have to reverse the order of the words in long values and also the order of bytes within words, due to the way Intel CPU's store data. Thus, this is the value 00 00 00 03. The root node is node number 3. (0004) numkent the number of key entries in this key file (57 00 00 00); this key file contains 57 keys, which just happens to be the number of records in the data file. This is only true if all keys must be unique. If duplicate keys are allowed, the number of keys may be different from the number of records. (0008) numnode the number of nodes in this key file (03 00 00 00); this key file contains a total of 3 nodes. (000C) lastnod the number of the last node in this key file (01 00 00 00); this is the "last" node in the file; however, the last node of the file may only contain part of the leaf level. (0010) keyeof the record number of the "last" record in the data file (4D 00 00 00); this is the record number that contains the key of "highest" value; in this case, it is record 4D hex (77 decimal), which contains the state abbreviation for state "ZZ." (0014) keybof the record number of the "first" record in the data file (57 00 00 00); this is the record number that contains the key of "lowers" value; in this case, it is record 57 hex (87 decimal), which contains the state abbreviation for state "00." (0018) unused the number of the first "unused" node in the key file (00 00 00 00); when a node is deleted, the deleted nodes are chained together so that all nodes in the key file will be in used before the key file grows in size. This key file has no deleted nodes, so this field has value of 0. (001C) keytyp the type of key (60); remember, this is a bit map, so we have to break it down according to the structure mentioned above: 60 hex = 0110 0000; the bit map is flipped, so the "0000" is the "ftyp" value. Since it is 0, it means that this is a key file; if this were an index file, this field would have a value of 1. The other 4 bits (0110) correspond directly to the 1 bit fields above: dupsw = 0, uprsw = 1, optsw = 1, loksw = 0. This means no duplicates, key is all upper case, key is filled with spaces, and this file is not locked. (001D) keynode the number of keys per node (53); this is the maximum number of pointer/key combinations that a node will hold. 53 hex is 83 decimal; 83 times 2 (the length of each key) + 83 times 4 (the length of the "relptr" pointer with each key) + 13 (the size of the node header) = 511. (001E) numcmps the number of components in this key (01); this key has only one component. (001F) keylen the total length of each key entry (06 00); each pointer/key combination takes up 6 bytes (2 for the key itself, 4 for the pointer). (0021) numlvls the number of levels in the key file (02 00); this key file has a total of 2 levels. The rest of the header is filled with 0. Next comes the nodes. First, node 1: 00200: 53 00 00 00 00 02 00 00 00 03 00 00 00 51 00 00 |S............Q..| 00210: 00 34 34 52 00 00 00 35 35 53 00 00 00 36 36 54 |.44R...55S...66T| 00220: 00 00 00 37 37 55 00 00 00 38 38 56 00 00 00 39 |...77U...88V...9| 00230: 39 34 00 00 00 41 41 19 00 00 00 41 4b 18 00 00 |94...AA....AK...| 00240: 00 41 4C 31 00 00 00 41 52 07 00 00 00 41 5A 35 |.AL1...AR....A25| 00250: 00 00 00 42 42 03 00 00 00 43 41 36 00 00 00 43 |...BB....CA6...C| 00260: 43 2F 00 00 00 43 4F 26 00 00 00 43 54 32 00 00 |C/...CO&...CT2..| 00270: 00 44 43 37 00 00 00 44 44 27 00 00 00 44 45 38 |.DC7...DD'...DE8| 00280: 00 00 00 45 45 39 00 00 00 46 46 01 00 00 00 46 |...EE9...FF....F| 00290: 4C 04 00 00 00 47 41 3A 00 00 00 47 47 3B 00 00 |L....GA:...GG;..| 002AO: 00 48 48 1A 00 00 00 48 49 2C 00 00 00 49 41 30 |.HH....HI,...IA0| 002B0: 00 00 00 49 44 3C 00 00 00 49 49 20 00 00 00 49 |...ID<...II ...I| 002C0: 4C 2B 00 00 00 49 4E 3D 00 00 00 4A 4A 3E 00 00 |L+...IN=...JJ>..| 002D0: 00 4B 4B 17 00 00 00 4B 53 29 00 00 00 4B 59 0F |.KK....KS)...KY.| 002E0: 00 00 00 4C 41 3F 00 00 00 4C 4C 1F 00 00 00 4D |...LA?...LL....M| 002F0: 41 1E 00 00 00 4D 44 12 00 00 00 4D 45 1C 00 00 |A....MD....ME...| 00300: 00 4D 49 40 00 00 00 4D 4D 21 00 00 00 4D 4E 23 |.MI@...MM!...MN#| 00310: 00 00 00 4D 4F 0E 00 00 00 4D 53 10 00 00 00 4D |...MO....MS....M| 00320: 54 05 00 00 00 4E 43 09 00 00 00 4E 44 33 00 00 |T....NC....ND3..| 00330: 00 4E 45 42 00 00 00 4E 48 22 00 00 00 4E 4A 15 |.NE$...NH"...NJ.| 00340: 00 00 00 4E 4D 41 00 00 00 4E 4E 1D 00 00 00 4E |...NMA...NN....N| 00350: 56 0C 00 00 00 4E 59 13 00 00 00 4F 48 2D 00 00 |V....NY....OH-..| 00360: 00 4F 4B 42 00 00 00 4F 4F 1B 00 00 00 4F 52 0B |.OKB...00....OR.| 00370: 00 00 00 50 41 43 00 00 00 50 50 44 00 00 00 51 |...PAC...PPD...Q| 00380: 51 25 00 00 00 52 49 45 00 00 00 52 52 06 00 00 |Q%...RIE...RR...| 00390: 00 53 43 0A 00 00 00 53 44 46 00 00 00 53 53 02 |.SC....SDF...SS.| 003A0: 00 00 00 54 4E 47 00 00 00 54 54 16 00 00 00 54 |...TNG...TT....T| 003B0: 58 2E 00 00 00 55 54 48 00 00 00 55 55 11 00 00 |X....UTH...UU...| 003C0: 00 56 41 0D 00 00 00 56 54 49 00 00 00 56 56 14 |.VA....VTI...VV.| 003D0: 00 00 00 57 41 2A 00 00 00 57 49 28 00 00 00 57 |...WA*...WI(...W| 003E0: 56 4A 00 00 00 57 57 08 00 00 00 57 59 4B 00 00 |VJ...WW....WYK..| 003FO: 00 58 58 4C 00 00 00 59 59 4D 00 00 00 5A 5A 00 |.XXL...YYM...ZZ.| Here is a breakdown of the header for node 1: (0200) keycnt the number of keys in this node (53); this is the total number of keys currently stored in this node. Hex 53 = 83 decimal. (0201) flink the forward link (00 00 00 00); this is the link to the next node on the same level in the tree. (0202) blink the backwards link (02 00 00 00); this is the link to the previous node on the same level of the tree; in this case, node 2 is on the same level of the tree. (0209) ulink the upwards link (03 00 00 00); this is the link to the parent node of this node; in this case, node 3 is the parent of node 1. Following this header are the pointer/key combinations. Here are the first few in the node: Offset Pointer Key Actual Value in Dump ====== ======= === ==================== 020D 51 44 (51 00 00 00) (34 34) 0213 52 55 (52 00 00 00) (35 35) 0219 53 66 (53 00 00 00) (36 36) 021F 54 77 (54 00 00 00) (37 37) Next is node 2: 00400: 04 01 00 00 00 00 00 00 00 03 00 00 00 57 00 00 |.............W..| 00410: 00 30 30 4E 00 00 00 31 31 4F 00 00 00 32 32 50 |.00N...11O...22P| 00420: 00 00 00 33 33 00 00 00 00 00 00 00 00 00 00 00 |...33...........| 00430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ........................................... ........................................... ........................................... 005F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| Here is a breakdown of node 2: (0400) keycnt (04) Node 2 only has 4 pointer/key entries. (0401) flink (01 00 00 00) Node 2 has a forward link to node 1. (0402) blink (00 00 00 00) Node 2 has no backward link. (0403) ulink (00 00 00 00) Node 2 has no upward link. Since node 2 only has 4 pointer/key entries, they are listed in their entirety here: Offset Pointer Key Actual Value in Dump ====== ======= === ==================== 040D 57 00 (57 00 00 00) (30 30) 0413 4E 11 (4E 00 00 00) (31 31) 0419 4F 22 (4F 00 00 00) (32 32) 041F 50 33 (50 00 00 00) (33 33) And finally, node 3: 00600: 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 |................| 00610: 00 33 33 01 00 00 00 5A 5A 00 00 00 00 00 00 00 |.33....ZZ.......| 00620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00650: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ........................................... ........................................... ........................................... 007F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| Here is a breakdown of node 3: (0600) keycnt (02) Node 3 only has 2 pointer/key entries. (0601) flink (00 00 00 00) Node 3 has no forward link. (0605) blink (00 00 00 00) Node 3 has no backward link. (0609) ulink (00 00 00 00) Node 3 has no upward link. Since node 3 only has 2 pointer/key entries, they are listed in their entirety here: Offset Pointer Key Actual Value in Dump ====== ======= === ==================== 060D 02 33 (02 00 00 00) (33 33) 0613 01 ZZ (01 00 00 00) (5A 5A) If you visualize this key file as a tree, then node 3 is the root node and nodes 1 and 2 are leaves. Since this key file is not very big, it has no interior nodes. So, how would we go about finding a key in this key file? Let's go through the steps to find the key "FL." 1) Read the header of the key file; this tells us that the key file has 2 levels, a maximum of 53 pointer/key structures per node, and each pointer/key structure takes up 6 bytes in a node. 2) Set our level counter to 1 and read the root node, which happens to be node 3. Node 3 has 2 pointer/key entries. 3) After skipping over the links in the node header, we come to the first pointer/key structure. It has a key value of "33," which is less than the key value we are looking for. Skip to the next pointer/key entry. 4) The next pointer/key structure has a key value of "ZZ," which is greater than the key we are looking for, so we follow the pointer associated with "ZZ." It has a pointer value of 01. 5) Since our level counter is 1 and we know that our key file has 2 levels, the pointer value is pointer to another node and not a data record. Go read node 1. 6) Increment our level counter to 2. Since the key file only has two levels, we know that we are now on the leaf level with node 1. 7) Start looking at pointer/key structures until we find a matching key; if we find no matching key, then the key does not exist in the key file. In this case, we do find it at offset 028B. It has a pointer value of 1, which is the record number of our data file that has the name of the state with the abbreviation "FL." This key file was fairly simple in nature, since the key is a string of only two characters. Let's take a look at the other data types and see how they get converted to key values. Long Long components are split into two "words," each 16 bits long. The order of these bytes are reversed; then the order of the words within the long are reversed. Example: If you have a long of the form AABBCCDD, the value is split into AABB CCDD; then the individual bytes are reversed, giving you BBAA DDCC. After that, you reverse the order of the words, giving you DDCCBBAA. Finally, toggle the high order bit. Short Short components are also split into two pieces and flipped; the high order bit is also toggled. AABB would turn into BBAA. String Strings are stored as-is, except they are converted to upper case if the key is not case sensitive. Picture Same as strings. Group Same as strings. Byte Stored unchanged. Decimal Decimal values are handled the following way: if the high order bit of the decimal value is turned on then flip all the bits in each part of the decimal else toggle the high order bit. In C, it would look something like this: /* comp[] is the decimal number */ if (comp[0] & 0x80) { for (x = 0; x < KEYDESC.elmlen; ++x) { comp[x] = ~comp[x]; } else { comp[0] |= 0x80; } Real Real values take up 8 bytes; whether or not the value is positive or negative, the order of the bytes is reversed. If the value of the last byte in the original real value is negative, then the high order bit of that byte is flipped. In C, it looks like this: char *cp1; /* the original real value */ char *sp2; /* the final real value */ cp1 = [address of real value to be converted]; cp1 += 7; /* start at the end of the number */ if (*cp1 < O) { /* is the end byte negative? */ for (x = 1; x <= 8; x++) { *cp2++ = ~*cp1--; } } else { *cp2++ = *cp1-- ^ 0x80 for (x = 2; x <= 8; x++) { *cp2++ = ~*cp1--; } }