Version 1.2 -
Nov-23-1998 - Last edit: Jan-07-2001
Abstract
OpenTagTM is
a format to encode data (mostly text) extracted from an original
file of any format. Its purpose is to allow the extraction of a
document, processing the text in a standard common format, and
then, if needed, merging the text back into its original format.
The terms ELEMENT,
ATTRIBUTE, VALUE, TAG and CONTENT used in this document and its
collateral's are meant in the sense they are used in a XML/SGML
context. Here are some examples:
Element with content:
<ELEM1 ATTR="value">content</ELEM1>
| | | | |
| | | | +-----> End tag of the element ELEM1
| | | +-------------> Content of the element ELEM1
| | +--------------------> Value of the attribute ATTR
| +--------------------------> Attribute ATTR
+--------------------------------> Start tag of the element ELEM1
Element without content (empty element):
<ELEM2 ATTR="value" />
| | | |
| | | +--------------> Closing marker of the empty element ELEM2
| | +--------------------> Value of the attribute ATTR
| +--------------------------> Attribute ATTR
+--------------------------------> Opening marker of the element ELEM2
An OpenTag file can be
encoded either as an ASCII 7-bit file, a 8-bit file or a UTF-16 16-bit
file. The XML encoding instruction must be specified if the file
is in an encoding different than UTF-8 or UTF-16.
Special care should be
taken when processing text in a multi-byte code set. For
paragraphs were the ws attribute allows the text to be wrapped:
additional line-breaks and spaces must not break characters.
When encoded in 16-bit,
the first two bytes of the file must be the Unicode Byte-Order-Mark
character (0xFFEF).
As any XML document, OpenTag files use
numeric character references (NCRs) to specify the characters that do not exist in the
encoding used. A numeric character reference can be either in hexadecimal or
decimal notation. The hexadecimal notation is &#xHHHH; where HHHH is the
hexadecimal value of the Unicode code point for the given character. The decimal
notation is &#DDDD; where DDDD is the decimal value of the Unicode code
point for the given character.
Example:
<p id="1">Lowercase "a grave" = à = à</p>
Several ASCII
characters need also to be coded with entities to avoid confusion
with OpenTag markers:
The character < (ASCII 0x3C) should be coded "<"
(or < or <).
The character & (ASCII 0x26) should be coded "&"
(or & or &).
The character " (ASCII 0x22)
should be coded """ (or " or ") in
attribute values enclosed between double-quotes.
The character ' (ASCII 0x27)
should be coded "') (or ' or %) in
attribute values enclosed between single-quotes.
In OpenTag, the
attributes of all structural and informative elements and
delimiter elements can be inherited. If an element does not have
some of its attributes specified, the values for those attributes
are the same as the values of the closest parent element.
After the XML
processing instruction comes the OpenTag document itself,
enclosed within the <opentag> element. An OpenTag document is composed
of zero, one or more sections, each enclosed within a <file> element.
The XML prologue is
mandatory. It sets the defaults for the encoding of the file. If
the encoding declaration is omitted, the file is assumed to be
either in UCS-2 or UTF-8. The first character of the file must be
the Unicode Byte-Order-Mark if the file is in UCS-2.
<?xml version="1.0" encoding="iso-8859-1" ?>
An ideal minimum
OpenTag document will look something like this:
OpenTag elements can be
divided into three main categories: the structural and
informative elements, the in-line elements and the delimiter
elements. Attributes are shared among them.
The structural elements
specify the frame of an OpenTag document as well as contextual
and processing information. The <p> element contains the extracted
data and, possibly, in-line elements.
<opentag>
OpenTag document - The
<opentag> element encloses all the other elements
of the document.
Here the <prop> element is
used to define a tool-specific property called "WordCount".
You could also use it to specify attached files, project
information, translation memory data, machine translation
processing data, etc.
The tool attribute identifies which tool has generated
the document so each property can be identified even if
two tools use the same property identifiers.
To define tool-specific data at the tag level, you can
use the ts attribute.
<grp>
Group - The <grp>
element specifies a set of elements that should be
processed together. For example: all the items of a menu,
several translations of the same paragraph, etc. A list
of preferred values for the type
attribute in <grp> is available. Note: A <grp> element can
contain other <grp> elements.
Here the <grp> element is
used to group together several <p> elements belonging to
the same dialog box.
<grp> could also be used to group several language
versions of the same <p> element.
<p>
Paragraph - The <p>
element is used to delimit a unit of text. A paragraph in
OpenTag does not necessarily correspond to a "paragraph"
in a word-processor. It's simply a unit of text that
could be a paragraph, a title, a menu item, a caption,
etc. A list of preferred values for the type
attribute in <p> is available.
Here the <csdef> element is
used to declare a user-defined code set called "Latin1Cirth"
which uses the ISO Latin-1 code set as a base (all code-points
not specified in the <csdef> are the same as ISO
8859-1).
<map/>
Character mapping - The
<map/> element specifies the correspondence between
a Unicode value and a code-point of a native code set.
Here the <map> elements
defines two user-defined characters. You must use Unicode
values that are within the range of the Private Use Area
(from U+E000 to U+F8FF). See the Unicode Standard 2.0
book, section 6.2 at page 6-119, for more information on
the Private Use Area.
<note>
Note - The <note>
element is used to add document-related comments to the
OpenTag document. XML comments ("<!-- ... -->")
are allowed but are not necessarily kept by processing
tools.
<grp id="4567">
<note>This paragraph must always be in uppercase</note>
<p lc="EN"><g id="1">WARNING:</g> YOU MUST
SETUP YOUR WORKING DIRECTORY BEFORE RUNNING
THE CONFIG tool.</p>
</grp>
The <note> element can be
used to document the extracted text, to provide
information between the different users that deal with
the file, etc.
Tools must keep <note> elements when they process
an OpenTag document. You can link a note to other
elements with the rid attribute.
The in-line elements
are the elements that can appear inside the core structural
element <p/>.
<g>
Generic group place-holder
- The <g> element is used to replace any in-line
code of the original document that has a beginning and an
end and can be moved within its parent structural element.
When possible, the type allows you to specify what kind
of attribute the place-holder represents. A list of preferred
values for the type attribute in <g> is available. Note: A <g> element can contain
another <g> element. In this case, if the embedded
group has an id attribute, it should never be
moved outside of its parent group.
Mandatory attributes:
None. But a <g> element
should at least have an id or type attribute to make sense.
<p><x id="1"/>Text with generic code place-holder.</p>
<ix/>
Index marker - The <ix/>
element specifies a reference to an index entry. The
definition of the entry itself is done in the
corresponding <ixd> element (both are
linked by their rid attribute, for which they have
the same value).
Index definition - The
<ixd> element is used to specify the entry
corresponding to one or more <ix/> elements. It does not
have to be in the same <p> or even the same <grp> element. Markers and
definitions do have to be in the same <file> element.
The <ixd> element used to
define a simple index entry. Here it defines the text for
all <ix/> markers that also have
the rid attribute set to "INDEX2".
<tx>
Index entry text- The <tx>
element is used to delimit the text of an index entry
level.
The <so> element used to
specify the "reading order" of an entry to sort
the symbol according its pronunciation.
<rf/>
Reference marker - The
<rf/> element specifies a reference to any type of
reference text (variable, pre-composed text, footnote,
etc.). The definition of the reference text itself is
done in one or more corresponding <p> elements (linked by
their rid attribute, for which they have
the same value).
<p id="1" rid="1" type="fn">Elephant: Big animal.</p>
<p id="2">The happy elephant<rf rid="1" type="fn"/>.</p>
<grp rid="2">
<p type="alt" id="3">Click here to go to Description</p>
<p type="link" id="4">http://www.xyz.com/desc.htm</p>
</grp>
<p id="5">See <g id="1" rid="2">Description</g>.</p>
The <rf/> element can be
used to reference anything. It simply marks the position
where the text should go. The link between reference
definition and marker is done with the rid attribute.
For example here the first paragraph is a definition of a
footnote that is located in the second paragraph. Note
that in some case the reference can be composed of
several <p> elements, like for the <g> element of the fifth
paragraph.
<ocs>
Original code set - The
<ocs> element is used to indicate the code set of a
part of the text that is different from the default code
set. Note that <ocs> is only informative; in the
OpenTag file the text within an <ocs> element is in
the same code set as the surrounding text.
<p><ocs cs="Symbol">✔</ocs> First
item of the list</p>
Here the <ocs> element
allows you to specify that the first character of the
paragraph is a check mark symbol and should be coded in a
code set different from the rest of the text when merged
back.
Remember that <ocs> does not specify a change of
code set in the OpenTag file itself.
<ct>
Conditional text - The
<ct> element is used to mark specific strings of
the text for a given condition.
OpenTag defines
additional elements to support various types of text processing.
These elements are usually not generated by the extraction module
and are ignored most of the time during merging, but they can be
very powerful with tools such as Machine Translation, glossary
handling, quality assurance, etc.
<s>
Segment - The <s>
element indicates a unit of text such as a sentence,
title, menu item, message, etc. The <s> element is
not part of the tags used to merge the OpenTag file back
into its original format. A list of preferred
values for the type attribute in <s> is available.
<p id="1"><s seg="1">Click OK. </s><s
seg="2">Save the file.</s></p>
The <s> element separates
segments within a paragraph. When the <p> element contains only a
single segment you can avoid using <s> and simply
use the seg attribute.
<mrk>
Marker - The <mrk>
element delimits a section of text that has special
meaning, such as a terminological unit, a proper name, an
item that should not be modified, etc. It can be used for
various processing tasks. For example, to indicate to a
Machine Translation tool, proper names that should not be
translated, for terminology verification, to mark suspect
expressions after a grammar checking. The <mrk>
element is usually not generated by the extraction tool
and it is not part of the tags used to merge the OpenTag
file back into its original format. A list of preferred values for the type
attribute in <mrk> is available.
Here the creation tool is
identified as "XYZ v1.0". Usually you want your
tool signature to indicate the version as well as the
tool.
The tool attribute allows you to know how you should
process tool-specific data such as <prop> elements and ts attributes.
datatype
Data type - The datatype
attribute specifies the kind of text contained in the
element. Depending on that type, you may apply different
processes to the data.
The date attribute specifies 25
November 1997 at 6am 12 minutes zero seconds.
lc
Locale - The lc attribute
specifies the locale of the text of a given element.
Value description:
A 2-letter code corresponding to
one of the language identifiers defined in ISO-639, or a 2+2-letter code
where the first 2 letters are one of the language
identifiers defined in ISO-639 followed by a dash and
one of the country/region identifiers defined in ISO-3166.
Note: The reserved xml:lang attribute defined in XML does
not correspond to OpenTag's definition of a locale, and
its scope rules are not appropriate for attributes in the
OpenTag case. Therefore OpenTag does not use it to
indicate locale/language. However the lc attribute uses
values that are very similar to the values used for xml:lang.
An OpenTag document can contain
multi-lingual data: The lc attribute is used to tag each
specific locale.
cs
Code set - The cs
attribute specifies the code set of the text for a given
element. When the encoding of the file is UCS-2 or ISO-646
the cs attribute is only informative.
Value description:
One of the code set identifiers
defined by the IANA, or a user-defined code
set name declared in a <csdef> element. A sub-set
of the preferred values is available in this document.
<p lc="EN">Text in English</p>
<p lc="cs"><ocs cs='cs="iso-8859-2">Text in Czech</ocs></p>
The text within an <ocs> element is in the same
code set as the rest of the file, but the cs attribute
indicates what was the original code set in the source
document.
name
Name - The name attribute
specifies the user-defined code set name of a <csdef> element.
This example shows how the name
attribute is used to identify a <csdef> element.
The name value can contain any characters, however, white
space characters are not recommended.
type
Type - The type attribute
specifies the context and the type of resource or style
of the data of a given element. For example, to define if
it is a label, or a menu item in the case of resource-type
data, or the style in the case of document-related data.
The id attribute can be extracted
from the original file, or generated automatically.
rid
Reference identifier - The
rid attribute is used to link different elements that are
related. For example, a reference to its definition, or
paragraphs belonging to the same group, etc.
Value description:
Alpha-numeric. It is recommended
to not use spaces.
In this example the attribute rid
links a reference marker with its definition later in the
file.
cond
Condition - The cond
attribute is used to identify an element corresponding to
conditional text in the original format.
You can use the <ct> element to set a
condition for a sub-set of text.
<p id="3" seg="4">Single segment in a
paragraph.</p>
The seg attribute can be used
directly in a paragraph if the paragraph contains only a
single segment. You can also mark segments this way in
each level of an index definition.
ts
Tool-specific data - The
ts attribute allows you to include short data understood
by a specific toolset.
You can also use the <prop> element to define large
properties at the element level.
Here the ts attribute is used to
specify the origin of a leveraged translation.
coord
Coordinates - The coord
attribute specifies the x, y, cx and cy coordinates of
the text for a given <p> or <grp> element. The cx and cy
values must represent the width and the height (like in a
Windows resource file). The extraction and merging tools
must make the right corrections for the original format
that uses a top-left/bottom-right coordinate system.
Value description:
Four decimal (possibly negative)
values, in the order: x,y,cx and cy, separated by semi-colons.
Font - The font attribute
specifies the font name and font size of the text for a
given <p> or <grp> element. The font
attribute would generally be used for resource-type data:
change of font in document-type data can be marked with
the <g> element.
Value description:
Name of the font and its size
separated by a semi-colon.
Font attribute in a file
extracted from Windows resources. The font information
could be used by resizing tools, to verify maximum length
of a translation, etc.
ws
White spaces - The ws
attribute specifies how white spaces (ASCII spaces, tabs
and line-breaks) should be treated.
Value description:
Its value must be:
- 0 if any consecutive white spaces are reduced to one
space ( ).
- 1 if all white spaces must be preserved (i.e. like in a
<PRE> element in HTML format).
- 2 like case 0, but excluding tab from white-spaces.
<p ws="0">Text
with 4 spaces </p>
<p ws="1">Text with 4 spaces </p>
The white spaces in the first
paragraph will be reduced to one space each. In the
second paragraph each white space should be preserved by
the tools. In this case the end result will be the same
for both paragraphs.
original
Original file - The
original attribute specifies the name of the original
file from which the contents of a <file> element has been
extracted.
The original attribute could be
used by tools to locate the various files needed when
merging back the OpenTag document.
reference
Reference
file - The reference attribute specifies the name of the reference
file that should be used to merge back the content of a <file> element into its original format.
The reference attribute could be
used by tools to locate the various files needed when
merging back the OpenTag document in case they are not the same as the
original file. The reference file is often called the
"skeleton" file.
base
Base code set - The base
attribute specifies the code set upon which the re-mapping
of characters defined by a given <csdef> element is based.
Unicode code - The ucode
attribute specifies the Unicode code-point of a given <map/> element.
Value description:
Its value must be a valid Unicode
code-point value, either a character reference format or
in text. In the latter case, the text must be the decimal
value of the code-point.
Substitution text - The
subst attribute specifies the text to substitute for a
character of a given <map/> element, when it does
not exist in the target code set.
Composition codes - The
comp attribute specifies the possible base Unicode
characters used to compose the character of a given <map/> element.
Value description:
Its value must be a list of two,
three, four or five Unicode values (including user-defined
characters). Each value is separated by a semi-colon and
should be either in character reference format or in text.
In the latter case, the text must be the decimal value of
the Unicode code-point.
For example, in Vietnamese the
letter a circumflex can have an additional acute tone
mark. Some fonts may need to have a direct mapping to
this combination.
case
Case - The case attribute
specifies the opposite case character for one given in a <map/> element. (e.g. 'A' is
the case change for 'a').
Value description:
Its value must be a valid Unicode
code-point value (including user-defined characters),
either character reference format or text format. In the
latter case, the text must be the decimal value of the
code-point.
In this example, the var
attribute is used to specify an additional proposition
for the translation, here coming from a Machine
Translation product. It could be used for marking other
automatic translations (TM, glossary leveraging, etc.),
verified text, even keeping a history list.
This section lists the
recommended values for some of the attributes. Values for these
attributes are not case sensitive. These lists are purely
informative, the goal is to specify a preferred syntax so tools
can have some level of compatibility.
Values
for the type attribute of the <grp>, <p>, <s> and <rf/> elements. This list is not
exhaustive.
shortcut (Windows
accelerators, shortcuts in resource or property files)
button (button
in UI)
caption (title
in UI, caption in documentation, alternate text, etc.)
checkbox (check
box in UI)
cell (text
in a table cell)
dialog (dialog
box in UI)
file (filename,
path)
footer (footer
text)
font (font
name)
frame (frame
or window, or any generic group of components).
header (header
text)
heading (title
or header-type segment)
keywords (list
of keywords, enumeration within a paragraph, etc.)
label (static
text, label in UI, etc.)
listitem (paragraph
in a list, entry in a list box, etc.)
menu (menu)
menuitem (entry
in a UI menu)
message (prompt,
error or warning message)
radio (radio
button in UI)
string (generic
text from source code, string table, etc.)
var (variable)
fn (footnote)
Values for the type attribute of the <mrk> element. This list is not
exhaustive.
abbrev (abbreviation,
acronym, etc.)
datetime (date
or time information)
name (proper
or common name)
phrase (sub-sentence
level)
protected (text
that should remain untouched during the process)
term (one
or more words of a terminology entry)
Values for the type attribute of the <g> element. This list is not
exhaustive.
bold (bold
or strong text)
font (text
with font size, font face, color changes etc. )
italic (italicized
text)
link (hypertext)
underlined
(underlined text)
Values for the type attribute of the <x/> element. This list is not
exhaustive.
pb (paragraph
break)
lb (forced
line break)
Values for the cs and base attributes. This list is not exhaustive,
but gives you examples from which you can guess additional names.
The indentations
are only to illustrate the hierarchy of the elements,
they are not required.
The BOLD
elements and attributes are mandatory.
ITALICS
indicates elements and attributes that can be specified
zero or one time.
NORMAL typeface is
used for the elements and attributes that can be
specified zero, one or more times.
UNDERLINED
typeface indicates the actual text and non-structural
codes (the data).
Sample:
<?XML version="1.0" encoding="iso-8859-1" ?><opentag version="1.2"
xmlns="urn:OpenTag:Version12">
<!-- First file, from a Java property file. It contains several locales. -->
<filelc="EN-US"
tool="Java_OTF:1.01-004:Java"
datatype="Java"
original="//brazil/recife/devile/data/app.pro"
reference="//brazil/recife/devile/data/extract/app.skl">
<grp rid="id_DLG_STATUS" type="label">
<grp id="IDC_ACTIVITY" coord="8;72;54;10">
<p lc="EN-US">&Activity</p>
<!-- Tools specific data, e.g. in this case leverage information -->
<p lc="FR-FR" ts="100%,Gandalf3.tm">&Activit&x00e9;</p>
</grp>
</grp>
<!-- Example of a note generated by a filter -->
<note>Extraction word count = 1</note>
</file>
<!-- Second file, this time from RTF. It contains only the text of the source language. -->
<filelc="EN-US"
tool="Borneo 1.00-017"
datatype="RTF"
original="//brazil/recife/devile/data/help.rtf">
<!-- Definition for two user-defined characters. -->
<csdef name="Latin1Cirth" base="ISO-8859-1">
<note>For more information about the Cirth see
the Web page http://www.indigo.ie/egt/standards/csur/cirth.html
</note>
<map code="130" ucode="" ent="noldorian_o"/>
<map code="S" ucode="57558" ent="noldorian_oo"/>
</csdef>
<p id="1">This is a text in <g type="bold">bold</g>
and <g id="1">all caps red</g></p>
<p id="2">Second paragraph with graphic <x id="1"/>.</p>
</file></opentag>