WORD CODING TEXT COMPRESSION ALGORTHM by Jeff Connelly The word coding compression scheme is only for text files. Actually, it can work for binary files if they contain spaces at small intervals, but this is rarely the case. A dictionary of string variables is used. The maximum size can be up to FFFFh, but you might want to make it smaller for speed and memory reasons. The dictionary does not need to be initalized. Compression A space-delimited word is read from the file. Then is the dictionary is searched for the the word just read. If it is in the dictionary, the location (0 to FFFF) is emited to the output file as 16-bit unsigned integer. If the word is not in the dictionary, it is added to the first free space, and the place where it was just added is written to the output stream. This whole paragraph is repeated until the end of file is reached. The dictionary is stored at the end of the file, beginning with two nulls. Each word is seperated by a single null. Decompression Seek to the end of the file. Read each null-terminated word from the file and store it in the free space in the dictionary, stop when a double-null is encountered. Go back to the beginning of the file, and read a 16-bit unsigned integer from the input stream. Look it up in the dictionary, and write the value to the output stream. Repeat until the EOD code (0000) is read. Example Input : 'Hello Hello Hello Hello Hello' Output: 00 01 00 01 00 01 00 01 00 01 00 00 'Hello' Codes (two bytes per word) EOD Code: 00 00 _ Ratio : 17 / 30 = 56.6% To see if you understand, figure out how this output data was made. The answer is below. * Compression. * Source data: 'Hello Hello Hello Hello Hello' * Read a word from the input file: 'Hello'. It is not in the dictionary so add it as code 0001, and output the 0001 code. * (Repeated 4 times because four words left and I don't want to repeat typing) * Read a word, it is 'Hello'. * It is in the dictionary as code 0001, so output the code 0001. * The output file is now: 0001 0001 0001 0001 0001. * Write the end-of-dictionary (EOD) indicator: 0000. * Write the dictionary, it just contains 'Hello'. * Output file is: 0001 0001 0001 0001 0001 (Codes) 0000 (EOD) 'Hello' * Decompression. * Source: 0001 0001 0001 0001 0001 0000 'Hello' * Seek to the end of the file. * Read a null-terminated word: 'Hello', and add it to the dictionary as 0001. * Read another byte. It is 00, so there is a double-null, the EOD. * The end of the dictionary was found, so seek back to the beginning. * (Repeated 5 times because I don't want to type it five times) * Read 16 bits, and look up the value. * The value in the dictionary is 'Hello', so write that to the output. * Output file: 'Hello Hello Hello Hello Hello' If the input sample is 1000 10-byte words, the length of the input file would be (1000 * 10) + 10 == 10010. Using word coding, the dictionary would be 10 + 1 + 1 == 12 (words, word terminator, end of dictionary terminator). The coded values would be (1000 * 2) == 2000. The total file size, dictionary plus coded values, would be 12 + 2000 == 2012. Therefore the ratio would be 2012 / 10010 == 20.0% I choose FFFF for the max. number of elements in the dictionary because in decimal it is 65,635. That seems like enough. Word Coding (WC) is a good compression method for text-only files. Expansion Idea by Thomas Christensen the wordcoding is pretty usefull for textfiles, however a single byte can change the ratio a bit my simple idea is to add a singele byte as the first byte of all, this byte tell if we should use 1,2 3 or more byte to descripe a word, there is no point to use 2 byte dictionaries in the exampel above, hoever 2 byte also can be small in manuals e.g mulitlanguage manuals, there we can set to use 3 byte, when a single byte allow use to set we should use betweeen 1 or 255 byte dictionaries code i think 255 is more than realy. Hoever the EOD is always 2 bytes Example (standart 2 byte) Input : 'Hello Hello Hello Hello Hello' Output: 02 00 01 00 01 00 01 00 01 00 01 00 00 'Hello' Codes (two bytes per word) EOD Code: 00 00 _ Ratio : 18 / 30 = 60.0% Example (expansion 1 byte) Input : 'Hello Hello Hello Hello Hello' Output: 01 01 01 01 01 01 00 00 'Hello' Codes (one bytes per word) EOD Code: 00 00 _ Ratio : 13 / 30 = 43.4%