The CSV format ============== by Paul Hsieh .............................................................................. Introduction ============ This specification is based on partial specification by Robert J. Lynch and description by Creativyst Software. This specification explands on these and makes an attempt to very specifically and completely nail down the format in a way that is compatible with existing practice for CSV file generation. This description should provide a way to make CSV a deterministically portable format. What is the CSV file format? ============================ The Comma Seperated Values, or CSV, format is an encoding format for a finite two dimensional collection of raw data (originally developed by Microsoft for use with its Excel product). The individual datum encodings are any arbitrary contents that can be encoded into a sequence of octets (this includes any arbitrary binary data or things like Unicode's UTF-8, or in fact, CSV encodings themselves). The CSV format encodes a sequence of Fields each of which is a sequence of Entries. CSV imposes no limits on Entry length, Field length or Field count. Generally, the format is a merely a sequence of octets with the comma (,) quote (") and carriage return (\r) characters taking special meaning that provides the underlying structure of the data. All octets appearing between a pair of bracketting quotes have no special meaning besides being raw data except for quotes themselves which should be doubled up to guarantee correct parsing. Otherwise the comma octet seperates Entries in a Field, and the carriage return octet seperates Fields. Each entry is either a quoted or unquoted octet data sequence -- surrounding whitespace is considered irrelevant and trimmed upon consumption. Example 1: ---------- A CSV file with the following contents: 1234,The Big Ol' Bear Would be parsed as a single field with two data entries, with the contents "1234" and "The Big Ol' Bear". Note that the CSV file format does not dictate any meta-information, so the fact that the first field seems to be a 4-digit number is not something that the CSV file format implicitely encodes. The next example: "1234 Harrington St, Northwest",Suite 17 Stop 3 Shows an example of a quoted entry. The quotes are required because of the embedded comma (,) in the first entry. In the following example: "1234 West ""Q"" St.", 0 The first entry (1234 West "Q" St.) contains quote characters that must be doubled to be rendered properly. The second entry (0) has a leading white space which is ignored. The next example: "1234 West "Q" St.", 0 is an example of legacy formed CSV data resulting from a bug the program "Paradox". A sufficiently forgiving parser should inferr that the quote characters before and after Q character are octet data in sequence with the rest of the octet data for the first entry rather than a quoting escape sequence. This makes this CSV data identical to the example before it. The way that the CSV parser is able to deduce that these quotes are interior data is that they are not followed by an optionally whitespace prefexed comma, carriage return or EOF. These improper encodings are strongly discouraged since they don't provide ways of encoding an entry which contains a quote octet followed by either a comma or carriage return octet. The next example: "Thomas Aquinus, Esq. Prosecutor for the Pope St. Luke's Dungeon Somewhere in Italy" Shows a single entry which contains carriage return octets. The following example: ,Thos.,"",Aquinus,Esq,Pros.forPope,,"Somewhere..." shows that an empty entry may be expressed either as 0-octets between delimiters, or an empty quote pair. Grammar ======= Note that the CSV file format is described in terms of octets which embeds 7-bit ASCII as subset of values expressable by each octet. Such octets will generally referred to as ascii-8 format. So CSV does not have a direct encoding in incompatible formats such as EBCDIC, UTF-16 or Base-64. The correctly parsable CSV grammar is described as follows: WSChar = [ \v\f\t] NoWSChar = RawChar = NoQuoteChar = ParadoxBug = ["] WSChar* NoWSChar NoQuoteChar* QuotableText = NoQuoteChar* (["] ["] NoQuoteChar* | ParadoxBug)* NonBlankEntry = NoWSChar (RawChar* NoWSChar)? # Not bracketted by WS | ["] QuotableText ["] Entry = [] | NonBlankEntry Field = WSChar* # 0 entries | WSChar* NonBlankEntry WSChar* # 1 entry | WSChar* Entry WSChar* [,] Field # 2 or more entries CSVFile = Field | Field [\r] [\n]? CSVFile Legend: - [] Represents the empty string. - ascii-8 Represents one arbitrary octet. - [ ... ] Represents one character selected from the sequence of characters between the brackets. Non-text characters are described in a C language character compatible way. The [ character is represented as \[. - Represents one octet from the set A that is not in the set B. - A B Represents the concatenation of rule A with the rule B. - A* Represents the rule A concatenated with itself 0 or more times. - A? Represents either the rule A or []. - A | B Represents either the rule A or the rule B. - (A) Represents the rule A taken as a whole. Words which start with a capital letter are non-terminal rules. Notes: - To accomodate for improperly formated quoted entries (such as is output by Borland's Paradox), any enclosed single quote which is not followed by any amount of white space then either a "," or carriage-return should be considered a stand-alone double quote. Going forward, implementations should not output CSV data using the ParadoxBug rule. - An input line which is just white space is 0 entries, not one single blank entry. A single blank entry must be explicitely quoted. - The LF (\n) character is ignored as a Field seperator (i.e., its considered white space) but NOT ignored as data within a quoted entry. It is not legal to include this character as non-quoted data. Carriage Return versus Line Feed -------------------------------- In some systems, CSV file data may be read in text mode (this is a data losing operation) in which case carriage returns (\r) are never read but instead removed and replaced by linefeed (\n) if one is not already immediately following. In this case, the grammar would have to be modified in the CSVFile production rule: CSVFile = Field | Field [\n] CSVFile Note that this does not solve all CSV structure problems when dealing with moving CSV files from platform to platform which use different conventions for text file delimiters (note that Windows/DOS, Unix and Mac OS are all different.) In particular in some systems a carriage return followed by a linefeed may be interpreted as two linefeeds. Because of these anomolies and because it is inherently data losing, it is highly recommended that where possible CSV data not be consumed as files opened in text mode, but only as binary data. Discussion ========== Implementations --------------- An incremental CSV parser can be written which is nothing more than a state machine parser. (This means that it is possible to encode the CSV grammar into a regular expression.) So it is not necessary to perform backwards or bidirectional parsing, nor is an arbitrary amount of storage required to perform the parsing operations. This also means that CSV file parsing can be done with very high performance. Many naive implementations assume that Fields correspond exactly to lines parsed from the CSV file. This is not the case because of the posibility of quoting the CR octet. Trying to read CSV file in a line by line manner will generally not be an effective way of attempting to parse CSV files. The C/C++ programming language is a notorious cess pool of "fixed array length" kind of solutions. Programmers should not by trapped by such ideas in an attempt to parse generic CSV files. There are some well known libraries than can be used in various ways to assist in the parsing of CSV files that escape the fixed array length problem for the C programming language. These include: The Better String Library - http://bstring.sf.net/ SGLIB - http://www.xref-tech.com/sglib/main.html C++ programmers should familliarize themselves with the Standard Template Library; in particular the std::string class and std::vector<> template, rather than using fixed length arrays. The binary data in each entry needs to be properly "unpacked" from its CSV encoding (to compress the double quotes to single quotes.) This binary data is also clearly location independent and thus should be intrinsically pre-serialized. This means that CSV can contain data in any arbitrary format so long as it can be encoded in a sequence of octets. This includes Unicode UTF-16 mode data. The problem is that octet encoded UTF-16 may introduce extra comma or quote octets even if they do not represent those characters in the actual text. If, instead, UTF-8 encoding is used, then the appearance of comma or quote octets in the encoding is exactly one to one with their appearance as actual code points in the unicode text itself. Thus UTF-8 encoding has the advantage that quote usage detection and pre-doubling can be performed either at the UTF-8 level or the CSV encoding level, whereas with UTF-16 this can only be done at the CSV encoding level. UTF-8 enabled text editors can also be used on CSV files directly. The original implementor of CSV output files is, of course, Microsoft. However, Microsoft has chosen not to openly document the format, thus this specification has been largely deduced. There is some confusion about how Microsoft Excel's behavior is related to the CSV format. In particular, Excel will remove white space and leading 0s from numeric entries. In this specification, the position taken is that this is not an inherent attribute of the CSV format, but rather just how Excel choses to deal with data after it reads it from CSV files. Some CSV file users insist that CSV files have a uniform column structure dictated by the first Field. Like Microsoft Excel's interpretation, such layout constraints are entirely application specific and not dictated by the CSV file format itself. Possible Future Growth of the CSV standard ------------------------------------------ CSV is a very old historical standard. There have not been any known efforts to clearly extend the capabilities of the CSV format. The syntax for CSV is a nearly complete format, however one clear opportunity for expansion lies in unquoted entries which contain quote characters. For example if an entry were encoded as something like: ext"Extension-Name:extension-data This would normally simply fall outside of the CSV grammar. But detecting it still falls into the realm of a state machine implementation. So such a strategy may represent a way to extend the standard. This specification does not propose any extensions, nor does it take a position on the possible use of this or other mechanisms for extending the CSV format beyond the grammar described. Advantages ---------- The format is fairly straight forward to parse (compared to XML), and is universally portable to just about any system. The CSV format can encode recursively (a single data entry may itself be a quoted CSV encoding), meaning that it can be used to encode arbitrarily convoluted structures. The CSV format can encode both ASCII and UTF-8 directly, meaning that it can be read and manipulated directly with any sufficiently enabled text editor (some text editors may unify CR and LF characters that can result in lost data). No schema or template is assumed which allows it to be freeform, and thus general enough to map to any deterministic static data format. CSV files and their parsers have very low overhead versus other more general formats such as XML. Disadvantages ------------- Because it is freeform, the data format essentially only supports sequential processing (as opposed to random access). Historically, the CSV file has been output by programs that deviated from the specification described above making parsing them non-deterministic. The format does not contain any provisions for annotation, or meta-data (compare to XML). The CSV file format has been created by Microsoft, and thus is technically proprietary despite the existence of this specification -- this can represent a real problem if, in the future, Microsoft chooses to extend the format in ways that are not easy to deduce.