This document was last updated on Sunday, December 13, 1998

Introduction

Every effort has been made to document LZH header formats as well as changes made for features not yet implemented. Corrections, additions and suggestions are always welcomed. Header fields in Italics are currently under development for Huffman Compression Engine only, and should be ignored (skipped) if not supported, as should any extended header. If you are a developer of compression utilities in .lzh file formats, please feel free to jump in and help.

Table of Contents

Introduction. 1

LZH format. 3

level-0..... 3

level-1, level-2..... 3

Level 0 header structure.. 4

Level 1 header structure.. 5

Level 2 header structure.. 6

Extended headers.... 7

Handling of extended headers.... 9

Method ID............. 10

Variances 11

Huffman Compression Engine II....... 11

Generic time stamp.... 11

OS ID... 12

Links to other LHA utilities and compression references:. 13

0x39 Multi-disk field.... 14

SPAN_COMPLETE. 14

SPAN_MORE......... 14

SPAN_LAST........... 15

size_of_run............. 15

span_mode:............. 15

Termination............. 18

LZH format

Byte Order: Little-endian

There are 3 types of LZH headers, level-0, level-1, and level-2.

All .lzh and .lha files are null terminated. The last byte of the file should be a 0, but if it's not, it will be interpreted as null terminated. The reason for this termination is that it implies that the next header-size is zero. Huffman Compression Engine adds the null but doesn't need it.

level-0

LZH header

Compressed file

LZH header

Compressed file

LZH header of size 0 (1 byte null)

level-1, level-2

LZH header

Extension header(s)

Compressed file

In all cases, read the first 21 bytes of the header. After determining the header type, you will then have to handle the header as needed or suggested.

Level 0 header structure

Offset	Size in bytes	Description
0	1	Size of archived file header (h)
1	1	1 byte Header checksum
2	5	Method ID
7	4	Compressed file size Refered to as C in subsequent fields. As there are no extended headers in level 0 format archive headers, this value represents the size of the file data only.
11	4	Uncompressed file size
15	4	Original file date/time (Generic time stamp)
19	1	File or directory attribute
20	1	Level identifier (0x00) Programmers should only read the first 21 bytes of the header before taking further action
21	1	Length of filename in bytes (Refered to as N in subsequent fields
22	N	Path and Filename
22+N	2	16 bit CRC of the uncompressed file
24+N	C	Compressed file data

Level 1 header structure

Offset	Size in bytes	Description
0	1	Size of archived file header (h)
1	1	1 byte Header checksum
2	5	Method ID
7	4	Compressed file size Refered to as C in subsequent fields Note: Compressed size includes the size of all Extended headers for the file.
11	4	Uncompressed file size
15	4	Original file date/time (Generic time stamp)
19	1	File or directory attribute
20	1	Level identifier (0x00) Programmers should only read the first 21 bytes of the header before taking further action
21	1	Length of filename in bytes (Refered to as N in subsequent fields
22	N	Path and Filename
22+N	2	16 bit CRC of the uncompressed file
24+N	1	Operating System identifier. See OS ID chart
25+N	2	Next Header size
27+N	3 or more bytes	Extended headers ( Note: Extended headers are optional, and have no preset maximum. The first byte of an extended header identifies the type of header, and the last 2 bytes of the header identify whether or not more headers are defined. Huffman Compression Engine will use the extended header filename if both exist in level 1 archives.
	C	Compressed file data

Level 2 header structure

Offset	Size in bytes	Description
1	2	Total size of headers, including Extended headers for this entry. This field is unimportant as long as the extended headers are looped appropriately. For compatibility with other archivers however, a variable should be assigned to add up the size of each extended header.
2	5	Method ID
7	4	Compressed file size Referred to as C in subsequent fields This value excludes the size of all Extended headers and only refers to the actual compressed data. This is an improvement, and can be problematic if not handled properly.
11	4	Uncompressed file size
15	4	Original file date/time (Generic time stamp)
19	1	File or directory attribute. Not supported in all compression utilities.
20	1	Level identifier (0x00) Programmers should only read the first 21 bytes of the header before taking further action
21	2	16 bit CRC of the uncompressed file
23	1	Operating System identifier. See OS ID chart
24	2	Next Header size
26	3 or more bytes	Extended headers ( Note: Extended headers are not optional, and have no preset maximum. Minimum compatibility should include a type1 extended header. The first byte of an extended header identifies the type of header, and the last 2 bytes of the header identify whether or not more headers are defined. Huffman Compression Engine will use the extended header filename if both exist in level 1 archives. Your loop for reading of these headers should include offset 24 for the first assignment and loop until extended header size = 0.
	C	Compressed file data

Extended headers

Unspecified size fields are intentionally left blank.

ID	Size	Description
0x00	2	CRC-16 of header and an optional information byte.
0x01		Filename
0x02		Directory name
0x39		0x39 Multi-disk field
0x3f		Uncompressed file comment.
0x48 0x4f		Reserved for Authenticity verification




0x5?	2 1	UNIX related information. Optional information byte
0x50	2	UNIX file permission
0x51	2 2	Group ID UserID
0x52		Group Name
0x53		User Name
0x54		Last modified time in UNIX time

0xcn		Under development: Compressed file comment. Method -lhn- is assumed Compressed comment cannot exceed 64K in size. Applicable range for Huffman Compression Engine (4..8)
0xdx 0xff		Under development: Operating system specific header info. These fields may have different meanings for different platforms. If the file was not created on the same platform as your own these signatures should be ignored.
0xd1		Under development Autodelete after autorun

Handling of extended headers.

Proper procedure for handling of extended headers can be summed up in virtual code:

1 Read the first 21 bytes of the real header to determine the size of the first extended header. If the

2 Use the first extended header in the loop that reads subsequent headers

3 Assign this value to a variable which is used inside the extended header loop

4 Repeat

5 Allocate enough memory for the header

6 Read the header into the array

7 Assign headersize to a word variable at the last 2 bytes of the array

8 Goto 3

9 Handle compressed data if it exists.

Code has not been supplied intentionally. It is expected that the programmer reading this document has enough knowledge of programming to perform the task of writing real code.

Method ID

Signature	Description
-lh0-	No compression
-lzs-	2k sliding dictionary(max 17 bytes)
-lz4-	No compression
-lh1-	4k sliding dictionary(max 60 bytes) + dynamic Huffman + fixed encoding of position
-lh2-	8k sliding dictionary(max 256 bytes) + dynamic Huffman (Obsoleted)
-lh3-	8k sliding dictionary(max 256 bytes) + static Huffman (Obsoleted) This method is not supported by Huffman Compression Engine
-lh4-	4k sliding dictionary(max 256 bytes) + static Huffman + improved encoding of position and trees
-lh5-	8k sliding dictionary + static Huffman
-lh6-	32k sliding dictionary + static Huffman
-lh7- -lh8-	64k sliding dictionary + static Huffman Lh8 has yet to be discovered except in my own utility. It existed from 0.21d to 0.21M and was actually -lh7- per Harihiko's review of Yoshi's notes.
"-lhd-"	Directory (no compressed data). This signature may not contain any information at all. In some cases it only is used to signify that there are extended headers with important information. In level 0 archives it most likely contains the directory for the next header's file, but in level 1 and 2 headers, it most likely contains nothing except for possibly the size of the first extended header.

Huffman Compression Engine II dynamically sets the method for compression based on the file size. Reasoning: It makes little sense to use a 64KB buffer if the file is 1K. The next section covers variances, which basically define what to do in known cases where the file signature does not match the dictionary size.

Variances

Huffman Compression Engine II

Versions after 0.21M now dynamically size the dictionary buffer according to the size of the file to be compressed. It is not uncommon to have signatures of -lh4- thru -lh7-. A future implementation will include -lh9- through -lhc- and -lhe-, which represent dictionary sizes of 128KB through 2MB. -lhd- is skipped due to the fact that it is a reserved signature. During 0.21, a misunderstanding about the method identifier lead me to think that a 64KB buffer was -lh8-. Although logic dictated that it should be, 16KB dictionaries were bypassed entirely, which lead to this confusion. In order to simplify the problem, decode should use 64KB for -lh6- through -lh8-.

Generic time stamp

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16

|<-------- year ------->|<- month ->|<-- day -->|

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

|<--- hour --->|<---- minute --->|<- second/2 ->|

Offset Length Contents

0 8 bits year years since 1980

8 4 bits month [1..12]

12 4 bits day [1..31]

16 5 bits hour [0..23]

21 6 bits minite [0..59]

27 5 bits second/2 [0..29]

OS ID

'ID	Platform
M	MS-Dos
'2'	OS/2
'9'	OS9
'K'	OS/68K
'3'	OS/386
'H'	HUMAN
'U'	UNIX
'C'	CP/M
'F'	FLEX
'm'	Mac
'w'	Windows 95, 98
'W'	Windows NT
'R'	Runser

Links to other LHA utilities and compression references:

The following hyperlinks are visible in word97. If you cannot see these links, you may also jump on the Internet to http://fidonews.webworldinc.com/lzhformat.htm to see them in your favorite browser.

LHA World by Dr. Haruyasu Yoshizaki

LHA Page

Dolphin's Home Page The author, Tsugio Okamoto, maintains "Lha for Unix."

If you are interested in how LHA works, its source code is a very good place to start.

This site includes useful info like LHA header specs, in Japanese.

Lha for UNIX patch by Yoshioka Tsuneo.

MacLHA for Macintosh systems.

Network Mahjong International (LHA in Java)

Micco's HomePage (UNLHA32.DLL UNARJ32.DLL LHMelt)

Haruhiko Okumura's Compression Pointers. Haruhiko Okumura is the original designer of the lha compression algorithm.

Compatibility with the above links can only be guessed at this point, as few lha style compressors support anything above -lh5-. However, if your interest is in maintaining compatibility with these other platforms, -K5 added to the command line during compression will force compatibility with these compression utilities.

0x39 Multi-disk field

The following is a planned implementation. The original can be found at:

http://www.creative.net/~aeco/jp/lzhspc01.html

struct MDF {

byte span_mode;

long beginning_offset;

long size_of_run;

}

span_mode: This identifies the mode of this segment of file. The values are:

#define SPAN_COMPLETE 0

#define SPAN_MORE 1

#define SPAN_LAST 2

SPAN_COMPLETE.

This specifies that the information following this header contains a complete (optionally compressed) file. This is often unused because MDF is not needed in these cases. In an unsplit file, the header information and format should follow the standard LZH format.

SPAN_MORE.

This specifies that the information following this header is incomplete. The uncompressor needs to concatenate this segment with

information from the following volume. It should continue to do that until it sees a volume with a header information that contains span_mode

SPAN_LAST

This specifies that the information following this header is the last segment of the (optionally compressed) file.

beginning_offset:

This value specifies the location in bytes of where this segment (run) of information will fit into.

size_of_run

This is the size of this segment of information.

span_mode:

This identifies the mode of this segment of file. The values are:

#define SPAN_COMPLETE 0

#define SPAN_MORE 1

#define SPAN_LAST 2

SPAN_COMPLETE. This specifies that the information following this header contains a complete (optionally compressed) file. This is often unused

because MDF is not needed in these cases. In an unsplit file, the header information and format should follow the standard LZH format.

SPAN_MORE. This specifies that the information following this header is incomplete. The uncompressor needs to concatenate this segment with

information from the following volume. It should continue to do that until it sees a volume with a header information that contains span_mode

SPAN_LAST.

SPAN_LAST. This specifies that the information following this header is the last segment of the (optionally compressed) file.

beginning_offset: This value specifies the location in bytes of where this segment (run) of information will fit into.

size_of_run: This is the size of this segment of information.

The illustration below contain two volumes with two compressed files, one of them split between the two volumes. "File 1" is compressed and fits

within the first volume. "File 2" is a file 100 bytes long compressed to 90 bytes. The first 50 bytes of which resides on the first volume and the

last 40 bytes on the next.

Volume 1

+--------------+

| +----------+ |

| |LZH header| | <- MDF not needed

| +----------+ | <- header unchanged from non-spanned versions of LZH

| | File 1 | |

| | | |

| +----------+ |

| |

| +----------+ | <- span_mode = SPAN_MORE

| |LZH header| | <- beginning_offset = 0

| +----------+ | <- size_of_run = 50

| | File 2 | |

| | split | |

| | | |

+--------------+

Volume 2

+--------------+

| +----------+ | <- span_mode = SPAN_LAST

| |LZH header| | <- beginning_offset = 50

| +----------+ | <- size_of_run = 40

| | File 2 | |

| | | |

| +----------+ |

| | [0] | | <- end of volumes, a byte with value zero (0)

| +----------+ |

+--------------+

Termination

-----------

In addition to the above changes, the compressor must before closing the file after writing the last volume, write a null byte at the end of the file. This byte serves to inform the decompressor that this is the last volume and no other comes after it.

This end of volume byte is needed to tell the decompressor when to stop prompting for the next volume. This termination byte is optional, as the decompressor may also stop when it has completed a file, i.e., see SPAN_LAST or just a regular file with

MDF. However if the end of this completed file coincides with an end of volume, there would be not way for the compressor detect that the following

volumes and prompt for them.

This termination byte is a way around this potential bug.

Note that this null byte coincides with header_size in the LZH header.

Pseudo code for a sample implementation

---------------------------------------

boolean spanned = false;

while(file_available()) {

compress file();

if(size_of_compressed_file > available_space()) {

while(size_of_compressed_file) {

size_to_write = available_space();

construct_header_with_MDF(size_to_write);

write_header();

write_data(size_to_write);

size_of_compressed_file -= size_to_write;

prompt_for_next_volume();

spanned = true;

}

else {

construct_header_without_MDF(size_of_compressed_file);

write_header();

write_data(size_of_compressed_file);

}

if(spanned)

write_null_byte();

Author’s address:

Aeco Systems

826 28th Avenue

San Francisco, CA 94121

Phone: (415) 221-7806

EMail: aeco@creative.net