Reverse engineering the Quark Xpress file format
by Frans Faase
In the periode from February 2001 till May 2002, I have
spend many hours reverse
engineering the Quark Xpress Binary File Format as
used by Quark Xpress
a widely used DTP program.
I have decided to bring my results in the public
domain in the form of a source distribution under
the GNU General Public License.
I would be very happy if any additional
discoveries about the Quark Xpress file formats that are
made with the use of this program, are also made public under
the GNU General Public License.
Although the program can read all the files I needed to read,
it by no means is complete, and could possibly crash on any other
file. The biggest limitation is that the program can only read
files produced by some earlier MAC versions of Quark Xpress.
Files saved by the windows version use a different byte order
for the integers. (On February 22,
2001, I already released a very first version of
the program, which was able to read
some Windows files.)
At the moment I have only very limited time available
for supporting anyone continueing the reverse engineering
of the Quark Xpress formats. Please do not ask me any
questions about the code, because if you are not able to
read the code as it has been provided, you very likely will
not be able to reverse engineer the binary file format any
further. (Read: Requirements.)
If you want to continue working on the Windows file formats,
please read the last section on the page.
For professional conversions of Quark Xpress to
XML, I point to the following resources:
The story in eightteen parts
Below the eightteen entries in my online diary in which I tell about my progress.
- Februay 17, 2001
- Februay 19, 2001
- Februay 20, 2001
- Februay 21, 2001
- Februay 22, 2001
- April 7, 2001
- April 16, 2001
- April 21, 2001
- April 24, 2001
- April 26, 2001
- April 29, 2001
- May 1, 2001
- May 4, 2001
- May 13, 2001
- May 17, 2001
- May 22, 2001
- April 16, 2002
- April 26, 2002
Description of the format
To describe a binary format is difficult, because it needs to
be exact, consize, and easy to read. Grammers (such as
BBF) could be of some help. A good
way for describing a binary format is to provide a program
that can parse the format. So far, I haven't had time to
write down a well documented description of the format. The
only documentation is thus the program that is provided here.
Please read the above story to get some ideas about the general
structure that Quark Xpress uses. For a detailed description
study the file scanQXDoc.cpp starting with the
function scan_file. I have tried to write the
scan_* functions in such a way that they represent
the "grammar". All these function operate on CReadBuf objects,
which represent a buffer with a given length containing a
part of the data from the file. I have made use of a set
of macro defines (in capitals and starting with an underscore
character) for the various elements in the grammer. Some
argument of these macros are only useful for producing
readable output. Below a short description of some of
these macros, which read process some data:
- _SUB_BUFFER and _SAFE_SUB_BUFFER:
create sub-buffer (second argument) from a given buffer
(first argument) and a given length (third argument)
known by a certain name (fourth argument).
- _SKIPBYTE: skips a byte.
- _SKIPWORD: skips a word (two bytes).
- _SKIPLWORD: skips a long word (four bytes).
- _SKIPBYTES: skips a number of bytes.
- _SKIPBYTES_S: same, but with printing.
- _BYTE: reads a byte in an already defined variable.
- _WORD: likewise for a word.
- _LWORD: likewise for a long word.
- _VARBYTE: reads a byte in a newly defined variable.
- _VARWORD: likewise for a word.
- _VARLWORD: likewise for a long word.
- _VARPASCALSTRING: reads a PASCAL like string,
where the length of the string is specified by the first
byte.
- _VARPASCAL2STRING: likewise, but next data
starts at an even number of bytes from the first character.
- _VARPASCALFIXSTRING: likewise, but extended to
a fixed length.
- _EXPECTBYTE: expect a byte with the given value (first argument).
- _EXPECTWORD: likewise for a word.
- _EXPECTLWORD: likewise for a long word.
- _CALL and _CALL_IC: for calling another
scanning function.
- _DONE: checks if the given buffer has been
read till the end.
The purpose of the rest of the macros is just for formatting
the output in case an error was detected.
You can download the sources in a single zip file from
here. The sources compile with
the Cygnus gcc compiler (version 2.95.2) in the Cygnus
unix under Windows environment. (Compilation problems
can occur with newer versions of gcc.)
To build the program, simply compile the file
scan.cpp as it includes all the other
sources.
Please note that the files CQXDoc.cpp
and CDatabase.cpp are made with the
cls2cpp program
from the file CQXDoc.cls and
CDatabase.cls files. Please do not
edit these .cpp files, but generate
them from the .cls files. You could use
the following shell script for building the
program:
#!/bin/sh
make cls2cpp
cls2cpp CDatabase
cls2cpp CQXDoc
gcc -g -Wall scan.cpp -o scan.exe
Of course, you could also write a small make file
for doing the job. I didn't take the effort to save
the half second to run the program each time.
Below, a short description of the files found in the source distribution
is given.
The file scan.cpp
The main file in the source distribution is the file
scan.cpp. This file includes all the other
files. No header files have been used. With current day
computers, it is often much faster to simply include all
the sources into a single file, then to compile all the
C++ files into separate object files, and having to link
them together. Also for larger projects, where most of
the time is spend on reading large number of include files,
this could be a much faster approach, than the traditional
way of compiling and linking.
The file stddef.c
Just a collection of handy functions and macros that I often
use in my C/C++ programs.
The files CBuf.cpp and CReadBuf.cpp
These files implement a number of classes to read data
from a buffer. The class CBuf implements the buffer, and
the classes CReadBuf and CReadButWithBlocks implements
procudures to read various kinds of values from a CBuf
buffer.
The files MMFile.cpp and MMFileDummy.cpp
The file MMFile.cpp implements a
persistent store
(database) making use of a Memory Mapped File.
The file MMFileDummy.cpp implements a replacement
for MMFile.cpp which is not persistent. The scan.cpp
provided in the distribution uses non-persistent implementation.
If you want to use the persistent implementation, you might
want to change the filename used in the open method,
and increase the size of the store. The program may crashs in case
of an overflow.
The files CQXDoc.cls (and CQXDoc.cpp)
This defines the classes for storing the logical structure
of a Quark Xpress documents including many of it style
definitions.
It also contains the class CTextAccessor which is
an accessor to formatted text from a text fragment with all
its formatting instructions. For an example how to use it,
see the file DumpQXDoc.cpp.
It also contains the class CTextOnFramesAccessor
which could be used to walk over the whole text of a book.
There are no examples of it use given in the code distribution,
but you should be able to figure out how to use it by yourself.
It also contains some elementary parsing methods.
The files CDatabase.cls (and CDatabase.cpp)
This defines a few classes for organizing some Quark Xpress
files into books and maintaining a collection of books.
The file scanQXDoc.cpp
This contains the actual scanner. It makes some heavy
use of some tricky defines. The idea is that the code
describes the grammar, but in case of an error, the
parsing jumps back to a certain point and repeats the
parsing, but now with dumping information. This makes
it easier to figure out what went wrong. The system
does not always work perfect.
The file FrameGeom.cpp
This file contains some code for determining the natural
reading order of the frames. It also deals with nested
frames. The algoritm used is probably not perfect, but it
served my purpose well. After the main routine has been
called, all frames found in the documents of a "book" are
linked through first_frame_reading_order and
next_reading_order.
The file DumpQXDoc.cpp
This file contains some routines to dump the information
to file either plain text or HTML, but it could be modified
to dump it to any format you want. This is more an example,
than a working piece of code. A lot of intelligence is
in the class CTextAccessor from the file
CQXDoc.cls.
For those who want to continue the work on reverse engineering
the Windows file formats, I hereby also give access to the
latest version of the program which can
read some file produced by Quark Xpress 4.1 for Windows. I have
not been able to date the version. It is definitely later than
May 22, 2001. I think it is from
earlier this year, as it makes use of an early implementation
of the class CBuf.
Actually, this version was produced on October 12, 2002, when I
made some last modification to make it generate an XML file that
can be viewed with IE!
I am not very proud of this program, because the code contains
a lot of rubbish. Please do not look at it, if you are not an
expert programmer. At some points it might even cause for more
confusion than be of some help. (I am afraid it does contain some
amouth of dead code.) When run, it produces a lot of debugging
output on stdout. I usually redirect this to a file.
A file with the extenstion .xml will be generated, if
the program does not crash, which I am afraid is very likely,
if you feed it an arbitrary Quark Xpress 4.1 document.
If you want to contribute to the reverse engineering of the Quark
Xpress file formats, do not develop this program further, but rather
make modifications to the latest source base.
I am not willing to publish any modifications to the qq.cpp
program. You may do yourself, of course.
Only extracting text
Based on the above sources, sed
developed a program
which simply extracts the raw texts from a Quark Xpress file
for Mac versions 3.3 and 4.0. The text are extracted in the order in
which they occur in the file, which is not very likely to match the
order in which the occur in the document.
My life as a hacker |
How to crack a Binary File Format |
Software engineering