WORD CODING TEXT COMPRESSION ALGORTHM
by Jeff Connelly

The word coding compression scheme is only for text files.  Actually, it can
work for binary files if they contain spaces at small intervals, but this
is rarely the case.

A dictionary of string variables is used.  The maximum size can be up to
FFFFh, but you might want to make it smaller for speed and memory reasons.
The dictionary does not need to be initalized.

Compression
A space-delimited word is read from the file.  Then is the dictionary is
searched for the the word just read.  If it is in the dictionary, the
location (0 to FFFF) is emited to the output file as 16-bit unsigned integer.
If the word is not in the dictionary, it is added to the first free space,
and the place where it was just added is written to the output stream.  This
whole paragraph is repeated until the end of file is reached.

The dictionary is stored at the end of the file, beginning with two nulls.
Each word is seperated by a single null.

Decompression
Seek to the end of the file.  Read each null-terminated word from the file
and store it in the free space in the dictionary, stop when a double-null is
encountered.  Go back to the beginning of the file, and read a 16-bit
unsigned integer from the input stream.  Look it up in the dictionary, and
write the value to the output stream.  Repeat until the EOD code (0000) is
read.

Example
Input : 'Hello Hello Hello Hello Hello'
Output:  00 01 00 01 00 01 00 01 00 01  00 00 'Hello'
          Codes (two bytes per word)     EOD   Code: 00 00
                     _
Ratio : 17 / 30 = 56.6%
To see if you understand, figure out how this output data was made.  The
answer is below.
* Compression.
* Source data: 'Hello Hello Hello Hello Hello'
* Read a word from the input file: 'Hello'.  It is not in the dictionary so
  add it as code 0001, and output the 0001 code.
* (Repeated 4 times because four words left and I don't want to repeat typing)
  * Read a word, it is 'Hello'.
  * It is in the dictionary as code 0001, so output the code 0001.
* The output file is now: 0001 0001 0001 0001 0001.
* Write the end-of-dictionary (EOD) indicator: 0000.
* Write the dictionary, it just contains 'Hello'.
* Output file is: 0001 0001 0001 0001 0001 (Codes) 0000 (EOD) 'Hello'
* Decompression.
* Source: 0001 0001 0001 0001 0001 0000 'Hello'
* Seek to the end of the file.
* Read a null-terminated word: 'Hello', and add it to the dictionary as 0001.
* Read another byte.  It is 00, so there is a double-null, the EOD.
* The end of the dictionary was found, so seek back to the beginning.
* (Repeated 5 times because I don't want to type it five times)
  * Read 16 bits, and look up the value.
  * The value in the dictionary is 'Hello', so write that to the output.
* Output file: 'Hello Hello Hello Hello Hello'

If the input sample is 1000 10-byte words, the length of the input file
would be (1000 * 10) + 10 == 10010.  Using word coding, the dictionary
would be 10 + 1 + 1 == 12 (words, word terminator, end of dictionary
terminator).  The coded values would be (1000 * 2) == 2000.  The total file
size, dictionary plus coded values, would be 12 + 2000 == 2012.  Therefore
the ratio would be 2012 / 10010 == 20.0%

I choose FFFF for the max. number of elements in the dictionary because
in decimal it is 65,635.  That seems like enough.

Word Coding (WC) is a good compression method for text-only files.


Expansion Idea by Thomas Christensen

the wordcoding is pretty usefull for textfiles, however a single byte can change the ratio a bit
my simple idea is to add a singele byte as the first byte of all, this byte tell if we should use 1,2 3 or more byte to 
descripe a word, there is no point to use 2 byte dictionaries in the exampel above, hoever 2 byte also can be small in manuals
e.g mulitlanguage manuals, there we can set to use 3 byte, when a single byte allow use to set we should use betweeen 1 or 255 byte 
dictionaries code i think 255 is more than realy.

Hoever the EOD is always 2 bytes


Example (standart 2 byte)
Input : 'Hello Hello Hello Hello Hello'
Output:  02 00 01 00 01 00 01 00 01 00 01  00 00 'Hello'
          Codes (two bytes per word)     EOD   Code: 00 00
                     _
Ratio : 18 / 30 = 60.0%




Example (expansion  1 byte)
Input : 'Hello Hello Hello Hello Hello'
Output:  01 01 01 01 01 01 00 00 'Hello'
          Codes (one bytes per word)     EOD   Code: 00 00
                     _
Ratio : 13 / 30 = 43.4%