OpenTag

Format Specifications

Version 1.2 - Nov-23-1998 - Last edit: Jan-07-2001


Abstract

OpenTagTM is a format to encode data (mostly text) extracted from an original file of any format. Its purpose is to allow the extraction of a document, processing the text in a standard common format, and then, if needed, merging the text back into its original format.

Contents

See also:


1. Overview

1.1. Markup

OpenTag is XML compliant. You can find the latest XML specifications at http://www.w3.org/TR/REC-xml.

The terms ELEMENT, ATTRIBUTE, VALUE, TAG and CONTENT used in this document and its collateral's are meant in the sense they are used in a XML/SGML context. Here are some examples:

Element with content:
<ELEM1 ATTR="value">content</ELEM1>
  |     |     |      |       |
  |     |     |      |       +-----> End tag of the element ELEM1
  |     |     |      +-------------> Content of the element ELEM1
  |     |     +--------------------> Value of the attribute ATTR
  |     +--------------------------> Attribute ATTR
  +--------------------------------> Start tag of the element ELEM1

Element without content (empty element):
<ELEM2 ATTR="value" />
  |     |     |     |
  |     |     |     +--------------> Closing marker of the empty element ELEM2
  |     |     +--------------------> Value of the attribute ATTR
  |     +--------------------------> Attribute ATTR
  +--------------------------------> Opening marker of the element ELEM2

1.2. File Encoding

An OpenTag file can be encoded either as an ASCII 7-bit file, a 8-bit file or a UTF-16 16-bit file. The XML encoding instruction must be specified if the file is in an encoding different than UTF-8 or UTF-16.

Special care should be taken when processing text in a multi-byte code set. For paragraphs were the ws attribute allows the text to be wrapped: additional line-breaks and spaces must not break characters.

When encoded in 16-bit, the first two bytes of the file must be the Unicode Byte-Order-Mark character (0xFFEF).

1.3. Extended and Escaped Characters

As any XML document, OpenTag files use numeric character references (NCRs) to specify the characters that do not exist in the encoding used. A numeric character reference can be either in hexadecimal or decimal notation. The hexadecimal notation is &#xHHHH; where HHHH is the hexadecimal value of the Unicode code point for the given character. The decimal notation is &#DDDD; where DDDD is the decimal value of the Unicode code point for the given character.

Example:

<p id="1">Lowercase "a grave" = &#xE0; = &#224;</p>

Several ASCII characters need also to be coded with entities to avoid confusion with OpenTag markers:

1.4. Inheritance

In OpenTag, the attributes of all structural and informative elements and delimiter elements can be inherited. If an element does not have some of its attributes specified, the values for those attributes are the same as the values of the closest parent element.

For example:

<grp lc="EN" rid="DLG1" id="34">
 <grp id="id_23">
  <p lc="FR">&amp;Chercher...</p>   <!-- inherited: id="id_23" id="43" rid="DLG1" -->
  <p>&amp;Find...</p>               <!-- inherited: lc="EN" id="id_23" id="43" rid="DLG1" -->
  <p lc="SV">&amp;s&#x00F6;k...</p> <!-- inherited: id="id_23" id="43" rid="DLG1" -->
 </grp>
</grp>

This rule applies for structural and delimiter elements but does not apply for the in-line elements.

1.5. Casing

XML is a case-sensitive markup. The names of elements and attributes in OpenTag are always lowercase.

1.6. Namespace

In case OpenTag markup is mixed with other content types and you need to use a namespace identifier, the URI for OpenTag is: urn:OpenTag:Version12.


2. Detailed Specifications

After the XML processing instruction comes the OpenTag document itself, enclosed within the <opentag> element. An OpenTag document is composed of zero, one or more sections, each enclosed within a <file> element.

2.1. XML Prolog

The XML prologue is mandatory. It sets the defaults for the encoding of the file. If the encoding declaration is omitted, the file is assumed to be either in UCS-2 or UTF-8. The first character of the file must be the Unicode Byte-Order-Mark if the file is in UCS-2.

<?xml version="1.0" encoding="iso-8859-1" ?>

An ideal minimum OpenTag document will look something like this:

<?xml version="1.0" ?>
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN" datatype="PlainText" original="file.ext">
  <p>Hello Word!</p>
 </file>
</opentag>

2.2. Elements and Attributes

OpenTag elements can be divided into three main categories: the structural and informative elements, the in-line elements and the delimiter elements. Attributes are shared among them.

The structural and informative elements <csdef>, <file>, <grp>, <map/>, <note>, <opentag>, <p>, and <prop>.
In-line elements <ct>, <g>, <ix/>, <ixd>, <lvl>, <ocs>, <rf/>, <so> <tx>, and <x/>
Delimiter elements <mrk>, and <s>.
Attributes base, case, cs, code, comp, cond, coord, datatype, date, ent, font, id, lc, name, original, reference, rid, seg, subst, tool, ts, type, ucode, var, version, and ws.

2.2.1. Structural and Informative Elements

The structural elements specify the frame of an OpenTag document as well as contextual and processing information. The <p> element contains the extracted data and, possibly, in-line elements.

<opentag>
OpenTag document - The <opentag> element encloses all the other elements of the document.
Mandatory attributes: version.
Optional attributes: None.
Contents: One or more <file> elements.
<?XML version="1.0"?>
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="PlainText"
  original="file.ext">
  <p>Hello Word!</p>
 </file>
</opentag>
OpenTag document with the minimal structure.

<file>
File - The <file> element corresponds to a single extracted original document.
Mandatory attributes: tool, datatype, original, lc.
Optional attributes: reference, date, type, ws, ts.
Contents: Zero, one or more <csdef/> elements, followed by zero, one or more of the following elements: <prop>, <note>, <grp>, <p>.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"   datatype="JavaText"
  original="Test1.java">
  ...
 </file>
 <file tool="XYZ v1.0" lc="EN"   datatype="rtf"
  original="\\brazil\recife\data.rtf>
  ...
 </file>
</opentag>
An OpenTag document with two <file> elements of different data types.

<prop>
Property - The <prop> element allows the tools to specify non-standard information in the OpenTag document.
Mandatory attributes: type.
Optional attributes: lc, rid.
Contents: Tool-specific data or text, no standard elements.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="JavaText"
  original="Test1.java">
  <p id="23" type="caption">Input</p>
  <p id="24" type="label">File name:</p>
  <prop type="WordCount">3</prop>
 </file>
</opentag>
Here the <prop> element is used to define a tool-specific property called "WordCount". You could also use it to specify attached files, project information, translation memory data, machine translation processing data, etc.
The tool attribute identifies which tool has generated the document so each property can be identified even if two tools use the same property identifiers.
To define tool-specific data at the tag level, you can use the
ts attribute.

<grp>
Group - The <grp> element specifies a set of elements that should be processed together. For example: all the items of a menu, several translations of the same paragraph, etc. A list of preferred values for the type attribute in <grp> is available.
Note: A <grp> element can contain other <grp> elements.
Mandatory attributes: None.
Optional attributes: tool, datatype, id, rid, seg, coord, font, type, lc, ws, ts, cond, var.
Contents: Zero, one or more of the following elements: <p>, <grp>, <note>, <prop>.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="JavaText"
  original="Test1.java">
  <grp rid="DLG_INPUT">
   <p id="23" type="caption">Input</p>
   <p id="24" type="label">File name:</p>
  </grp>
 </file>
</opentag>
Here the <grp> element is used to group together several <p> elements belonging to the same dialog box.
<grp> could also be used to group several language versions of the same
<p> element.

<p>
Paragraph - The <p> element is used to delimit a unit of text. A paragraph in OpenTag does not necessarily correspond to a "paragraph" in a word-processor. It's simply a unit of text that could be a paragraph, a title, a menu item, a caption, etc. A list of preferred values for the type attribute in <p> is available.
Mandatory attributes: None.
Optional attributes: tool, datatype, id, rid, seg, coord, font, type, lc, ws, ts, cond, var.
Contents: Text, zero, one or more of the following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/> and <rf/>.
<grp id="STR_item23">
 <p lc="EN">Monday</p>
 <p lc="fr-fr">Lundi</p>
 <p lc="TR">Ptesi</p>
 <p lc="cs">pond&#x011b;l&#x00ed;</p>
</grp>
A set of different translations of the same <p> element. In this example, the term "Monday" in English, French, Turkish and Czech.

<csdef>
Code set definition - The <csdef> element specifies user-defined code sets and characters.
Mandatory attributes: name, base.
Optional attributes: None
Contents: Zero, one or more <note> elements followed by one or more <map/> elements.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="rtf"
  original="c:\proj34\doc\Hobbit.rtf">
  <csdef name="Latin1Cirth"    base="iso-8859-1">
   <map code="130" ucode="&#xE0D5;" ent="noldorian_o"/>
   <map code="&#83;" ucode="57558" ent="noldorian_oo"/>
  </csdef>
 </file>
</opentag>
Here the <csdef> element is used to declare a user-defined code set called "Latin1Cirth" which uses the ISO Latin-1 code set as a base (all code-points not specified in the <csdef> are the same as ISO 8859-1).

<map/>
Character mapping - The <map/> element specifies the correspondence between a Unicode value and a code-point of a native code set.
Mandatory attributes: code, ucode.
Optional attributes: ent, comp, case, subst.
Contents: Empty.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="rtf"
  original="c:\proj34\doc\Hobbit.rtf">
  <csdef name="Latin1Cirth"
   base="iso-8859-1">
   <map code="130" ucode="&#xE0D5;" ent="noldorian_o"/>
   <map code="&#83;" ucode="57558" ent="noldorian_oo"/>
  </csdef>
 </file>
</opentag>
Here the <map> elements defines two user-defined characters. You must use Unicode values that are within the range of the Private Use Area (from U+E000 to U+F8FF). See the Unicode Standard 2.0 book, section 6.2 at page 6-119, for more information on the Private Use Area.

<note>
Note - The <note> element is used to add document-related comments to the OpenTag document. XML comments ("<!-- ... -->") are allowed but are not necessarily kept by processing tools.
Mandatory attributes: None.
Optional attributes: lc, rid.
Contents: Text, no standard elements.
<grp id="4567">
 <note>This paragraph must always be in uppercase</note>
 <p lc="EN"><g id="1">WARNING:</g> YOU MUST
SETUP YOUR WORKING DIRECTORY BEFORE RUNNING
THE CONFIG tool.</p>
</grp>
The <note> element can be used to document the extracted text, to provide information between the different users that deal with the file, etc.
Tools must keep <note> elements when they process an OpenTag document. You can link a note to other elements with the
rid attribute.

2.2.2. In-Line Elements

The in-line elements are the elements that can appear inside the core structural element <p/>.

<g>
Generic group place-holder - The <g> element is used to replace any in-line code of the original document that has a beginning and an end and can be moved within its parent structural element. When possible, the type allows you to specify what kind of attribute the place-holder represents. A list of preferred values for the type attribute in <g> is available.
Note: A <g> element can contain another <g> element. In this case, if the embedded group has an id attribute, it should never be moved outside of its parent group.
Mandatory attributes: None. But a <g> element should at least have an id or type attribute to make sense.
Optional attributes: id, type, rid, ts.
Contents: Text. Zero, one or more of the following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>.
<p>Text with some <g id="1"><g id="2">formatting</g> and some other.</g>
 

<x/>
Generic place-holder - The <x/> element is used to replace any code of the original document.
Mandatory attributes: None. But a <x/> element should at least have an id or type attribute to make sense.
Optional attributes: id, type, rid, ts.
Contents: Empty.
<p><x id="1"/>Text with generic code place-holder.</p>
 

<ix/>
Index marker - The <ix/> element specifies a reference to an index entry. The definition of the entry itself is done in the corresponding <ixd> element (both are linked by their rid attribute, for which they have the same value).
Mandatory attributes: rid.
Optional attributes: id, ts.
Contents: Empty
<p>Term<ix rid="INDEX2"> to index.</p>
 

<ixd>
Index definition - The <ixd> element is used to specify the entry corresponding to one or more <ix/> elements. It does not have to be in the same <p> or even the same <grp> element. Markers and definitions do have to be in the same <file> element.
Mandatory attributes: rid.
Optional attributes: id, ts.
Contents: One or more <lvl> elements.
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
The <ixd> element used to define a simple index entry. Here it defines the text for all <ix/> markers that also have the rid attribute set to "INDEX2".

<tx>
Index entry text- The <tx> element is used to delimit the text of an index entry level.
Mandatory attributes: None.
Optional attributes: id, seg, ts.
Contents: Text. Zero, one or more of the following elements: <s>, <mrk>, <g>, <ocs>, <ct>, <x/>, <rf/>, and zero or one so element.
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
The <txt> element used to define a simple index entry.

<lvl>
Level - The <lvl> element is used to delimit the different levels of an index entry.
Mandatory attributes: None.
Optional attributes: id, ts.
Contents: one <txt> element and zero or one <so> element.
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
The <lvl> element used to define a simple index entry.

<so>
Sort order - The <so> element indicates the text that should be used to sort an index entry in an <lvl> element.
Mandatory attributes: None.
Optional attributes: id, seg.
Contents: Text, no elements.
<p><ixd rid="INDEX2" id="34">
<lvl>$ command<so>dollar command</so>
</lvl></ixd></p>
The <so> element used to specify the "reading order" of an entry to sort the symbol according its pronunciation.

<rf/>
Reference marker - The <rf/> element specifies a reference to any type of reference text (variable, pre-composed text, footnote, etc.). The definition of the reference text itself is done in one or more corresponding <p> elements (linked by their rid attribute, for which they have the same value).
Mandatory attributes: rid.
Optional attributes: id, ts type.
Contents: Empty.
<p id="1" rid="1" type="fn">Elephant: Big animal.</p>
<p id="2">The happy elephant<rf rid="1" type="fn"/>.</p>
<grp rid="2">
 <p type="alt" id="3">Click here to go to Description</p>
 <p type="link" id="4">http://www.xyz.com/desc.htm</p>
</grp>
<p id="5">See <g id="1" rid="2">Description</g>.</p>
The <rf/> element can be used to reference anything. It simply marks the position where the text should go. The link between reference definition and marker is done with the rid attribute.

For example here the first paragraph is a definition of a footnote that is located in the second paragraph. Note that in some case the reference can be composed of several
<p> elements, like for the <g> element of the fifth paragraph.

<ocs>
Original code set - The <ocs> element is used to indicate the code set of a part of the text that is different from the default code set. Note that <ocs> is only informative; in the OpenTag file the text within an <ocs> element is in the same code set as the surrounding text.
Mandatory attributes: cs.
Optional attributes: id, ts.
Contents: Text, zero, one or more of the following elements: <s>, <mrk>, <g>, <ixd>, <ct>, <x/>, <ix/>, <rf/>.
<p><ocs cs="Symbol">&#x2714;</ocs> First
item of the list</p>
Here the <ocs> element allows you to specify that the first character of the paragraph is a check mark symbol and should be coded in a code set different from the rest of the text when merged back.
Remember that <ocs> does not specify a change of code set in the OpenTag file itself.

<ct>
Conditional text - The <ct> element is used to mark specific strings of the text for a given condition.
Mandatory attributes: cond.
Optional attributes: id, ts.
Contents: Text. Zero, one or more of the following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <x/>, <ix/>, <rf/>.
<p id="2">See <ct cond="doc">page 34</ct><ct
cond="hlp">screen 7</ct>
for more information.</p>
The <ct> element used to mark two different text corresponding to two different outputs.

2.2.3. Delimiter Elements

OpenTag defines additional elements to support various types of text processing. These elements are usually not generated by the extraction module and are ignored most of the time during merging, but they can be very powerful with tools such as Machine Translation, glossary handling, quality assurance, etc.

<s>
Segment - The <s> element indicates a unit of text such as a sentence, title, menu item, message, etc. The <s> element is not part of the tags used to merge the OpenTag file back into its original format. A list of preferred values for the type attribute in <s> is available.
Mandatory attributes: None.
Optional attributes: seg, ts, id.
Contents: Text. Zero, one or more of the following elements: <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>.
<p id="1"><s seg="1">Click OK. </s><s
seg="2">Save the file.</s></p>
The <s> element separates segments within a paragraph. When the <p> element contains only a single segment you can avoid using <s> and simply use the seg attribute.

<mrk>
Marker - The <mrk> element delimits a section of text that has special meaning, such as a terminological unit, a proper name, an item that should not be modified, etc. It can be used for various processing tasks. For example, to indicate to a Machine Translation tool, proper names that should not be translated, for terminology verification, to mark suspect expressions after a grammar checking. The <mrk> element is usually not generated by the extraction tool and it is not part of the tags used to merge the OpenTag file back into its original format. A list of preferred values for the type attribute in <mrk> is available.
Mandatory attributes: type.
Optional attributes: id, ts.
Contents: Text. Zero, one or more of the following elements: <s>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>.
<p lc="EN-US">Use a <mrk type="term">regular expression</mrk>
to search for <mrk type="name">Hobbit</mrk> item marker.
</p>
In this example the <mrk> element is used to tag a glossary term as well as a proper name.

2.2.4. Attributes

This section lists the various attributes used in the OpenTag elements. An attribute is never specified more than once for each element.

version
OpenTag version - The version attribute is used to specify the format version of the OpenTag document.
Value description: A number.
Default value: Empty string.
Element using it: <opentag>.
<opentag version="1.2">
 <file tool="XYZ v1.0" lc="EN"
  datatype="rtf"
  original="c:\simaril6\gandalf.rtf">
  ...
 </file>
</opentag>
This example shows an OpenTag document corresponding to the specifications of version 1.2.

tool
Creation tool - The tool attribute is used to specify the signature and version of the tool that created or modified the document.
Value description: Not defined by the standard.
Default value: Empty string.
Elements using it: <file>, <grp>, <p>.
<file tool="XYZ v1.0" lc="EN"
 datatype="rtf"
 original="c:\simaril6\gandalf.rtf">
 ...
</file>
Here the creation tool is identified as "XYZ v1.0". Usually you want your tool signature to indicate the version as well as the tool.
The tool attribute allows you to know how you should process tool-specific data such as
<prop> elements and ts attributes.

datatype
Data type - The datatype attribute specifies the kind of text contained in the element. Depending on that type, you may apply different processes to the data.
Value description: Not defined by the standard. However, a list of recommended values is provided.
Default value: Empty string.
Elements using it: <file>, <grp>, <p>.
<file lc="EN-US"
 tool="LXString 1.01-004"
 datatype="JavaString"
 original="//brazil/adm/tmp/app.pro">
 ...
</file>
The datatype attribute here specifies that the text in the file has been extracted from a Java property or source code file.

date
Date - The date attribute indicates when a given element was created or modified.
Value description: CCYY-MM-DDThh:mm:ss (for local time) or CCYY-MM-DDThh:mm:ssZ (for UTC time).
Default value: Empty string.
Element using it: <file>.
<file lc="EN-US"
 tool="Java_OTF 1.01-004"
 datatype="JavaString"
 original="//brazil/adm/tmp/app.pro"
 date="1997-11-25T06:12:00">
 ...
</file>
The date attribute specifies 25 November 1997 at 6am 12 minutes zero seconds.

lc
Locale - The lc attribute specifies the locale of the text of a given element.
Value description: A 2-letter code corresponding to one of the language identifiers defined in ISO-639, or a 2+2-letter code where the first 2 letters are one of the language identifiers defined in ISO-639 followed by a dash and one of the country/region identifiers defined in ISO-3166.
Note: The reserved xml:lang attribute defined in XML does not correspond to OpenTag's definition of a locale, and its scope rules are not appropriate for attributes in the OpenTag case. Therefore OpenTag does not use it to indicate locale/language. However the lc attribute uses values that are very similar to the values used for xml:lang.
Default value: Empty string.
Elements using it: <file>, <grp>, <p>, <note>, <prop>.
<grp id="id_SEARCH">
 <p lc="EN-US">&amp;Search...</p>
 <p lc="FR-FR">&amp;Recherche...</p>
</grp>
An OpenTag document can contain multi-lingual data: The lc attribute is used to tag each specific locale.

cs
Code set - The cs attribute specifies the code set of the text for a given element. When the encoding of the file is UCS-2 or ISO-646 the cs attribute is only informative.
Value description: One of the code set identifiers defined by the IANA, or a user-defined code set name declared in a <csdef> element. A sub-set of the preferred values is available in this document.
Default value: Empty string.
Elements using it: <ocs>.
<p lc="EN">Text in English</p>
 <p lc="cs"><ocs cs='cs="iso-8859-2">Text in Czech</ocs></p>
The text within an <ocs> element is in the same code set as the rest of the file, but the cs attribute indicates what was the original code set in the source document.

name
Name - The name attribute specifies the user-defined code set name of a <csdef> element.
Value description: Not specified by the standard.
Default value: Empty string.
Elements using it: <csdef>.
<csdef name="Latin1Cirth" base="iso-8859-1">
 <map code="130" ucode="&#xE0D5;" ent="noldorian_o"/>
</csdef>
This example shows how the name attribute is used to identify a <csdef> element.
The name value can contain any characters, however, white space characters are not recommended.

type
Type - The type attribute specifies the context and the type of resource or style of the data of a given element. For example, to define if it is a label, or a menu item in the case of resource-type data, or the style in the case of document-related data.
Value description: The value will depend on each element. A recommended list of values is provided by the standard.
Default value: Empty string.
Elements using it: <prop>, <file>, <grp>, <p>, <g>, <rf/>, <x/>.
<p type="message">Cannot find %s.</p>
<p type="label">List:</p>
The type attribute used to give context information with a paragraph.

id
Identifier - The id attribute is used in many elements, usually as a unique reference to the original corresponding format for the given element.
Value description: Alpha-numeric. It is recommended to not use spaces.
Default value: Empty string.
Elements using it: <grp>, <p>, <g>, <x/>, <ix/>, <ixd>, <lvl>, <so>, <rf/>, <ocs>, <ct>, <s>, <mrk>.
<p id="34">Extracted text</p>
<p id="IDC_file_OPEN">&amp;Open...</p>
The id attribute can be extracted from the original file, or generated automatically.

rid
Reference identifier - The rid attribute is used to link different elements that are related. For example, a reference to its definition, or paragraphs belonging to the same group, etc.
Value description: Alpha-numeric. It is recommended to not use spaces.
Default value: Empty string.
Elements using it: <grp>, <p>, <note>, <ix/>, <ixd>, <g>, <rf/>, <x/>, <prop>.
<p id="23">Start <rf rid="1"/>.</p>
<p id="24" rid="1">YZApplication</p>
In this example the attribute rid links a reference marker with its definition later in the file.

cond
Condition - The cond attribute is used to identify an element corresponding to conditional text in the original format.
You can use the
<ct> element to set a condition for a sub-set of text.
Value description: Alpha-numeric.
Default value: Empty string.
Elements using it: <grp>, <p>, <ct>.
<p id="12" cond="Common">Text for
<ct cond="DocOnly">the documentation</ct>
<ct cond="HlpOnly">the On-line help</ct>
only.</p>
This paragraph has some common text and two variations; one for documentation, the other for on-line help.

seg
Segment identifier - The seg attribute is used to mark an element as a segment or specific translation unit.
Value description: Alpha-numeric. It is recommended to not use spaces.
Default value: Empty string.
Elements using it: <grp>, <p>, <lvl>, <so>, <s>.
<p id="3" seg="4">Single segment in a 
paragraph.</p>
The seg attribute can be used directly in a paragraph if the paragraph contains only a single segment. You can also mark segments this way in each level of an index definition.

ts
Tool-specific data - The ts attribute allows you to include short data understood by a specific toolset.
You can also use the
<prop> element to define large properties at the element level.
Value description: Not defined by the standard.
Default value: Empty string.
Elements using it: <file>, <grp>, <p>, <ix/>, <lvl>, <rf/>, <ocs>, <ct>, <s>, <mrk>, <x/>, <g>.
<grp seg="9" >
 <p lc="EN-EN">XYZ printer Setup Dialog</p>
 <p lc="FR-FR" ts="98%,hobbit.tm"
  >Installation de l'imprimante XYZ</p>
</grp>
Here the ts attribute is used to specify the origin of a leveraged translation.

coord
Coordinates - The coord attribute specifies the x, y, cx and cy coordinates of the text for a given <p> or <grp> element. The cx and cy values must represent the width and the height (like in a Windows resource file). The extraction and merging tools must make the right corrections for the original format that uses a top-left/bottom-right coordinate system.
Value description: Four decimal (possibly negative) values, in the order: x,y,cx and cy, separated by semi-colons.
Default value: Empty string.
Elements using it: <grp>, <p>.
<grp type="button" coord="8;8;50;14;">
 <p lc="EN">&amp;Help...</p>
 <p lc="IT">&amp;Aiuto...</p>
</grp>
 

font
Font - The font attribute specifies the font name and font size of the text for a given <p> or <grp> element. The font attribute would generally be used for resource-type data: change of font in document-type data can be marked with the <g> element.
Value description: Name of the font and its size separated by a semi-colon.
Default value: Empty string.
Elements using it: <grp>, <p>.
<grp type="dialog" coord="0;0;100;150;"
 font="MS Sans Serif;8">
 <p type="caption">Settings</p>
 <p type="button">OK</p>
 <p type="button">Cancel</p>
</grp>
Font attribute in a file extracted from Windows resources. The font information could be used by resizing tools, to verify maximum length of a translation, etc.

ws
White spaces - The ws attribute specifies how white spaces (ASCII spaces, tabs and line-breaks) should be treated.
Value description: Its value must be:
- 0 if any consecutive white spaces are reduced to one space (&#x20;).
- 1 if all white spaces must be preserved (i.e. like in a <PRE> element in HTML format).
- 2 like case 0, but excluding tab from white-spaces.
Default value: 0
Elements using it: <file>, <grp>, <p>.
<p ws="0">Text 

with  4  spaces   </p>
<p ws="1">Text with 4 spaces </p>
The white spaces in the first paragraph will be reduced to one space each. In the second paragraph each white space should be preserved by the tools. In this case the end result will be the same for both paragraphs.

original
Original file - The original attribute specifies the name of the original file from which the contents of a <file> element has been extracted.
Value description: Alpha-numeric.
Default value: Empty string.
Element using it: <file>.
<file lc="EN-US"
 tool="Java_OTF 1.01-004"
 datatype="JavaString"
 original="//brazil/adm/tmp/app.pro"
 date="19971125T061200">
 ...
</file>
The original attribute could be used by tools to locate the various files needed when merging back the OpenTag document.

reference
Reference file - The reference attribute specifies the name of the reference file that should be used to merge back the content of a <file> element into its original format.
Value description: Alpha-numeric.
Default value: Empty string.
Element using it: <file>.
<file lc="EN-US"
 tool="Java_OTF 1.01-004"
 datatype="JavaString"
 original="//brazil/adm/tmp/app.pro"
 reference="//brazil/adm/tmp/app.skl"
 date="19971125T061200">
 ...
</file>
The reference attribute could be used by tools to locate the various files needed when merging back the OpenTag document in case they are not the same as the original file. The reference file is often called the "skeleton" file.

base
Base code set - The base attribute specifies the code set upon which the re-mapping of characters defined by a given <csdef> element is based.
Value description: One of the code set identifiers defined by IANA.
Default value: Empty string.
Element using it: <csdef>.
<csdef name="Latin1Cirth" base="iso-8859-1">
 <map code="130" ucode="&#xE0D5;" ent="noldorian_o"/>
</csdef>
In this example, the attribute base indicates that the code set upon which the user-defined characters are specified is ISO Latin-1.

code
Native code - The code attribute specifies the code-point of the given <map/> element in a native non-Unicode code set.
Value description: Its value must be either a character reference format or in text. In the latter case, the text must be the decimal value of the code-point.
Default value: Empty string.
Element using it: <map/>.
<map code="131"
 ucode="&#xE0D6;"
 ent="noldorian_oo"/>
 

ucode
Unicode code - The ucode attribute specifies the Unicode code-point of a given <map/> element.
Value description: Its value must be a valid Unicode code-point value, either a character reference format or in text. In the latter case, the text must be the decimal value of the code-point.
Default value: Empty string.
Element using it: <map/>.
<map code="131"
 ucode="57558"
 ent="noldorian_oo"/>
 

ent
Entity name - The ent attribute specifies the name of the character for a given <map/> element.
Value description: Its value must be a valid ASCII name (e.g. "amp" for '&').
Default value: Empty string.
Element using it: <map/>.
<map code="131"
 ucode="57558"
 ent="noldorian_oo"/>
 

subst
Substitution text - The subst attribute specifies the text to substitute for a character of a given <map/> element, when it does not exist in the target code set.
Value description: Its value must be a valid ASCII character or string (e.g "(c)" for '©').
Default value: Empty string.
Element using it: <map/>.
<map code="131"
 ucode="57558"
 ent="noldorian_oo"
 subst="oo"/>
 

comp
Composition codes - The comp attribute specifies the possible base Unicode characters used to compose the character of a given <map/> element.
Value description: Its value must be a list of two, three, four or five Unicode values (including user-defined characters). Each value is separated by a semi-colon and should be either in character reference format or in text. In the latter case, the text must be the decimal value of the Unicode code-point.
Default value: Empty string.
Element using it: <map/>.
<map code="167"
 ucode="57000"
 ent="acircacutetone"
 comp="&#x00e2;&#x0341;"/>
For example, in Vietnamese the letter a circumflex can have an additional acute tone mark. Some fonts may need to have a direct mapping to this combination.

case
Case - The case attribute specifies the opposite case character for one given in a <map/> element. (e.g. 'A' is the case change for 'a').
Value description: Its value must be a valid Unicode code-point value (including user-defined characters), either character reference format or text format. In the latter case, the text must be the decimal value of the code-point.
Default value: Empty string.
Element using it: <map/>.
<csdef name="Deseret" base="iso-8859-1">
 <map code="63" ucode="&#xE842;" ent="deseret_BEE" case="&#xE872;"/>
 <map code="111" ucode="&#xE872;" ent="deseret_bee" case="&#xE842;"/>
</csdef>
 

var
Variant - The var attribute allows you to identify different elements according to the way they have been generated.
Value description: Its value is not specified by the standard.
Default value: Empty string.
Elements using it: <grp>, <p>, <s>.
<grp id="45">
 <p lc="EN-US>&amp;Search...</p>
 <p lc="FR-FR>&amp;Chercher...</p>
 <p lc="FR-FR var="MT">&amp;Recherche...</p>
</grp>
In this example, the var attribute is used to specify an additional proposition for the translation, here coming from a Machine Translation product. It could be used for marking other automatic translations (TM, glossary leveraging, etc.), verified text, even keeping a history list.

3. Recommended Attribute Values

This section lists the recommended values for some of the attributes. Values for these attributes are not case sensitive. These lists are purely informative, the goal is to specify a preferred syntax so tools can have some level of compatibility.

Values for the type attribute of the <grp>, <p>, <s> and <rf/> elements. This list is not exhaustive.


Values for the type attribute of the <mrk> element. This list is not exhaustive.


Values for the type attribute of the <g> element. This list is not exhaustive.


Values for the type attribute of the <x/> element. This list is not exhaustive.


Values for the cs and base attributes. This list is not exhaustive, but gives you examples from which you can guess additional names.


Values for the datatype attribute. This list is not exhaustive.


4. Sample File

This section shows a short sample of an OTF file.

Notation conventions:

Sample:

<?XML version="1.0" encoding="iso-8859-1" ?>
<opentag version="1.2"
 xmlns="urn:OpenTag:Version12">
 <!-- First file, from a Java property file. It contains several locales. -->
 <file
  lc="EN-US"
  tool="Java_OTF:1.01-004:Java" 
  datatype="Java"
  original="//brazil/recife/devile/data/app.pro"
  reference="//brazil/recife/devile/data/extract/app.skl">  
  <grp rid="id_DLG_STATUS" type="label">
   <grp id="IDC_ACTIVITY" coord="8;72;54;10">
    <p lc="EN-US">&amp;Activity</p>
    <!-- Tools specific data, e.g. in this case leverage information -->
    <p lc="FR-FR" ts="100%,Gandalf3.tm">&amp;Activit&x00e9;</p>
   </grp> 
  </grp>
  <!-- Example of a note generated by a filter -->
  <note>Extraction word count = 1</note>  
 </file>
 <!-- Second file, this time from RTF. It contains only the text of the source language. -->
 <file
  lc="EN-US"  
  tool="Borneo 1.00-017"  
  datatype="RTF"    
  original="//brazil/recife/devile/data/help.rtf">
  <!-- Definition for two user-defined characters. -->
  <csdef name="Latin1Cirth" base="ISO-8859-1">
   <note>For more information about the Cirth see
    the Web page http://www.indigo.ie/egt/standards/csur/cirth.html
   </note>
   <map code="130" ucode="&#xE0D5;" ent="noldorian_o"/>
   <map code="&#83;" ucode="57558" ent="noldorian_oo"/>
  </csdef>
  <p id="1">This is a text in <g type="bold">bold</g>
and <g id="1">all caps red</g></p>
  <p id="2">Second paragraph with graphic <x id="1"/>.</p>
 </file>
</opentag>

5. History of Modifications

From 1.1b to 1.2 (Nov-06-1998)

From 1.1 to 1.1b (Apr-22-1998)

From 1.0 to 1.1 (Mar-18-1998)


6. References

. OpenTag provisions from other publications:

ISO 639:1988
Code for the representation of names of languages.
See
http://www.unicode.org/unicode/onlinedat/languages.html
ISO 3166:1993
Code for the representation of names of countries.
See
http://www.unicode.org/unicode/onlinedat/countries.html
ISO 646:1991
Information Technology -- ISO 7-bit coded character set for information interchange (ASCII).
ISO 8601:1998
Data elements and interchange formats - Information interchange - Representation of dates and times.
ISO 8879:1986
Information Processing - Text and Office Systems - Standard Generalized Markup Language.
See
http://www.sgmlopen.org/
ISO 10646-1:1993
Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - All parts.
See
http://www.unicode.org/
IANA Code set names
Code set naming conventions.
See
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Extensible Markup Language
Extensible Markup Language specifications.
See
http://www.w3.org/TR/REC-xml