================================================== ArcView(R) Version 2 Shapefile Technical Description An ESRI White Paper ================================================== CONTENTS ======== Why Shapefiles? Shapefile Technical Description Organization of the Main File Main File Record Contents Organization of the Index File Organization of the dBASE File Glossary ---------------------------------------------------------------------------------------------------- Copyright © 1994 Environmental Systems Research Institute, Inc. All rights reserved. Printed in the United States of America. The information contained in this document is the exclusive property of Environmental Systems Research Institute, Inc. This work is protected under United States copyright law and other international copyright treaties and conventions. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage or retrieval system, except as expressly permitted in writing by Environmental Systems Research Institute, Inc. All requests should be sent to Environmental Systems Research Institute, Inc., 380 New York Street, Redlands, CA 92373 USA, Attention: Contracts Manager. The information contained in this document is subject to change without notice. RESTRICTED RIGHTS LEGEND Use, duplication, and disclosure by the government are subject to restrictions as set forth in FAR ¤52.227-14 Alternate III (g)(3) (JUN 1987), FAR ¤52.227-19 (JUN 1987), or DFARS ¤252.227-7013 (c)(1)(ii) (OCT 1988), as applicable. Contractor/Manufacturer is Environmental Systems Research Institute, Inc., 380 New York Street, Redlands, CA 92373 USA. ESRI, ARC/INFO, PC ARC/INFO, ArcView, and ArcCAD are registered trademarks; Avenue, the ESRI corporate logo, the ESRI globe logo, and the ArcView logo are trademarks of Environmental Systems Research Institute, Inc. The names of other companies and products herein are trademarks or registered trademarks of their respective trademark owners. -------------------------------------------------------------------------------- ArcView Version 2 Shapefile Technical Description ArcView(R) Version 2 software introduces a new data file format that can be used by ArcView called a shapefile. This document defines this new spatial data format and describes why shapefiles are important. It lists the tools available in ESRI software for creating shapefiles directly or converting data into shapefiles from other formats. This document also provides all the technical information necessary for writing a computer program to create shapefiles without the use of ArcView or other ESRI software for organizations that want to write their own data translators. Why Shapefiles? =============== A shapefile stores nontopological geometry and attribute information for the spatial features in a data set. The geometry for a feature is stored as a shape comprising a set of vector coordinates. Because shapefiles do not have the processing overhead of a topological data structure, they have advantages over other data sources, such as faster drawing speed and edit ability. Shapefiles handle single features that overlap or that are noncontiguous. They also typically require less disk space and are easier to read and write. ArcView uses shapefiles just as it uses coverages--as a data source for a feature theme. The rest of the software functionality is identical for both shapefiles and coverages. Shapefiles support data editing functions in ArcView Version 2. Shapefiles can support point, line, and area features. Area features are represented as closed loop, double-digitized polygons. Attributes are held in a dBASE(R) format file. Each attribute record has a one-to-one relationship with the associated shape record. How Shapefiles Can Be Created -------------------------------------- Shapefiles can be created with the following four general methods: 1. Export from ArcView Version 2. Shapefiles can be created in ArcView Version 2 by exporting any theme to a shapefile. 2. Digitize in ArcView Version 2. Shapefiles can be created directly by digitizing shapes using ArcView Version 2 feature-creation tools. 3. Use the Avenue API. The Avenue Applications Programming Interface (API) provides tools to write shapefiles from another data source. For example, the script GPS2SHP reads coordinates from an ASCII file to create shapefiles with points, lines or polygons. A more complex example is the script MIF2SHP which uses the Avenue API to create shapefiles from data in MapInfo format. (These scripts, and others, are distributed with Avenue in the sample script library.) 4. Write to the file specifications. Write directly to the shapefile specifications by creating a program. ARC/INFO, PC ARC/INFO(R), and ArcCAD(R) software provides shape-to-coverage data translators, and ARC/INFO also provides a coverage-to-shape translator. For exchange with other data formats, the shapefile specifications are published in this paper. Other data streams, such as those from global positioning system (GPS) receivers, can also be stored as shapefiles or XY event tables. Shapefile Technical Description ============================== Computer programs can be created to read or write shapefiles using the technical specification in this section. An ArcView shapefile consists of a main file, an index file, and a dBASE table. The main file is a direct access, variable-record-length file in which each record describes a shape with a list of its vertices. In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file. The dBASE table contains feature attributes with one record per feature. The one-to-one relationship between geometry and attributes is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file. Naming Conventions ------------------------- All file names adhere to the 8.3 naming convention. The main file, the index file, and the dBASE file have the same prefix. The suffix for the main file is ".shp". The suffix for the index file is ".shx". The suffix for the dBASE table is ".dbf". Examples main file: counties.shp index file: counties.shx dBASE table: counties.dbf Numeric Types ------------------ A shapefile stores integer and double precision numbers. The remainder of this document will refer to the following types: Integer: Signed 32-bit integer (4 bytes) Double: Signed 64-bit IEEE double precision integer (8 bytes) The first section below describes the general structure and organization of the shapefile. The second section describes the record contents for each type of shape supported in the shapefile. Organization of the Main File ============================ The main file contains a fixed-length file header followed by variable-length records. Each variable-length record is made up of a fixed-length record header followed by variable-length record contents. Figure 1 illustrates the main file organization. Figure 1 Organization of the Main File +-------------------------------------+ | | | File Header | | | +-------------------------------------+-------------+ + Record Header | Record Contents | +----------------------------------------+----------+ + Record Header | Record Contents | +----------------------------------------+---+ + Record Header | Record Contents | +--------------------------------------------+ .. .. +-------------------------------------------+ + Record Header | Record Contents | +-------------------------------------------+ Byte Order ------------- The integers and double-precision integers that make up the data description fields in the file header (identified below) and record contents in the main file are in little endian (PC or Intel(R)) byte order. The integers and double-precision integers that make up the rest of the file are in big endian (Sun(R) or Motorola(R)) byte order. The Main File Header -------------------------- The main file header is 100 bytes long. Table 1 shows the fields in the file header with their byte position, value, type, and byte order. In the table, position is with respect to the start of the file. Table 1 Description of the Main File Header Byte Position Field Value Type Order Byte 0 File Code 9994 Integer Big Byte 4 Unused 0 Integer Big Byte 8 Unused 0 Integer Big Byte 10 Unused 0 Integer Big Byte 16 Unused 0 Integer Big Byte 20 Unused 0 Integer Big Byte 24 File Length File Length Integer Big Byte 28 Version 1000 Integer Little Byte 32 Shape Type Shape Type Integer Little Byte 36 Bounding Box Xmin Double Little Byte 44 Bounding Box Ymin Double Little Byte 52 Bounding Box Xmax Double Little Byte 60 Bounding Box Ymax Double Little Byte 68 Unused 0 Integer Big . . . . . . . . . . . . . . . Byte 96 Unused 0 Integer Big The value for file length is the total length of the file in 16-bit words (including the fifty 16bit words that make up the header). All the shapes in a shapefile are required to be of the same shape type. The values for shape type are as follows: Value Shape Type 1 Point 3 Arc 5 Polygon 8 MultiPoint Record Headers ------------------- The header for each record stores the record number and content length for the record. Record headers have a fixed length of 8 bytes. Table 2 shows the fields in the file header with their byte position, value, type, and byte order. In the table, position is with respect to the start of the file. Table 2 Description of Main File Record Headers Byte Position Field Value Type Order Byte 0 Record Number Record Number Integer Big Byte 4 Content Length Content Length Integer Big Record numbers begin at 1. The content length for a record is the length of the record contents section measured in 16bit words. Each record, therefore, contributes (4 + content length) 16-bit words toward the total length of the file, as stored at Byte 24 in the File Header. Main File Record Contents ========================= Shapefile record contents consist of a shape type followed by the geometric data for the shape. The length of the record contents depends on the number of parts and vertices in a shape. For each shape type, we first describe the shape and then its mapping to record contents on disk. In Tables 3 through 6, position is with respect to the start of the record contents. Point ------- A point consists of a pair of double-precision coordinates in the order X, Y. Point { Double X // X coordinate Double Y // Y coordinate } Table 3 Point Record Contents Byte Position Field Value Type Number Order Byte 0 Shape Type 1 Integer 1 Little Byte 4 X X Double 1 Little Byte 12 Y Y Double 1 Little MultiPoint -------------- A MultiPoint represents a set of points, as follows: MultiPoint { Double[4] Box // Bounding Box Integer NumPoin // Number of Points Point[NumPoints] Poin // The Points in the set } The bounding box is stored in the order Xmin, Ymin, Xmax, Ymax. Table 4 MultiPoint Record Contents Byte Position Field Value Type Number Order Byte 0 Shape Type 8 Integer 1 Little Byte 4 Box Box Double 4 Little Byte 36 NumPoints NumPoints Integer 1 Little Byte 40 Points Points Point NumPoints Little Arc ---- A shapefile arc can consist of multiple PolyLines that are not necessarily connected to each other. A PolyLine is an ordered set of vertices. Each PolyLine is referred to as a part of the arc. Arc { Double[4] Box // Bounding Box Integer NumParts // Number of Parts Integer NumPoints // Total Number of Points Integer[NumParts] Parts // Index to first Point in Part Point[NumPoints] Points // Points for all parts } The fields for an arc are described in detail below: Box: The bounding box for the arc stored in the order Xmin, Ymin, Xmax, Ymax. NumParts: The number of PolyLines in the arc. NumPoints: The total number of points for all PolyLines. Parts: An array of length NumParts. Stores, for each PolyLine, the index of its first point in the points array. Array indexes are with respect to 0. Points: An array of length NumPoints. The points for each PolyLine in the arc are stored end to end. The points for PolyLine 2 follow the points for PolyLine 1, and so on. The parts array holds the array index of the starting point for each PolyLine. There is no delimiter in the points array between PolyLines. Table 5 Arc Record Contents Byte Position Field Value Type Number Order Byte 0 Shape Type 3 Integer 1 Little Byte 4 Box Box Double 4 Little Byte 36 NumParts NumParts Integer 1 Little Byte 40 NumPoints NumPoints Integer 1 Little Byte 44 Parts Parts Integer NumParts Little Byte X Points Points Point NumPoints Little Note: X = 44 + 4 * NumParts. Polygon --------- A polygon consists of a number of rings. A ring is a closed, non-self-intersecting loop. The order of vertices or orientation for a ring indicates which side of the ring is within the polygon. The neighborhood to the right of an observer walking along the ring in vertex order is the neighborhood inside the polygon. Vertices for a single, ringed polygon are, therefore, always in clockwise order. The rings of a polygon are referred to as its parts. The polygon structure is identical to the arc structure, as follows: Polygon { Double[4] Box // Bounding Box Integer NumParts // Number of Parts Integer NumPoints // Total Number of Points Integer[NumParts] Parts // Index to first Point in Part Point[NumPoints] Points // Points for all Parts } The fields for a polygon are described in detail below: Box: The bounding box for the polygon stored in the order Xmin, Ymin, Xmax, Ymax. NumParts: The number of rings in the polygon. NumPoints: The total number of points for all rings. Parts: An array of length NumParts. Stores, for each ring, the index of its first point in the points array. Array indexes are with respect to 0. Points: An array of length NumPoints. The points for each ring in the polygon are stored end to end. The points for Ring 2 follow the points for Ring 1, and so on. The parts array holds the array index of the starting point for each ring. There is no delimiter in the points array between rings. The instance diagram in Figure 2 illustrates the representation of polygons. The following are important notes about Polygon shapes. The rings are closed (the first and last vertex of a ring MUST be the same). The order of rings in the points array is not significant. Polygons stored in a shapefile must be clean. The rings of a polygon cannot have segments that intersect each other. In other words, a segment belonging to one ring may not intersect a segment belonging to another ring. The rings of a polygon can touch each other at vertices, but not along segments. Figure 2 An Example Polygon Instance This figure shows a polygon with one hole and a total of eight vertices. v1 / \ / \ / \ / v5 \ / / \ \ / / \ \ v4 v8 v6 v2 \ \ / / \ \ / / \ v7 / \ / \ / \ / v3 For this example, NumParts equals 2 and NumPoints equals 10. 0 1 +----+----+ Parts: | 0 | 5 | +----+----+ | | | | | +-------------------+ | | 0 1 2 3 4 5 6 7 8 9 +----+----+----+----+----+----+----+----+----+----+ Points: | v1 | v2 | v3 | v4 | v1 | v5 | v8 | v7 | v6 | v5 | +----+----+----+----+----+----+----+----+----+----+ Table 6 Polygon Record Contents Byte Position Field Value Type Number Order Byte 0 Shape Type 5 Integer 1 Little Byte 4 Box Box Double 4 Little Byte 36 NumParts NumParts Integer 1 Little Byte 40 NumPoints NumPoints Integer 1 Little Byte 44 Parts Parts Integer NumPart Little Byte X Points Points Point NumPoints Little Note: X = 44 + 4 * NumParts. Organization of the Index File ============================= The index file contains a 100-byte header followed by 8-byte, fixed-length records. Figure 3 illustrates the index file organization. Figure 3 Organization of the Index File File Header Record Record Record Record . . . . . . Record The Index File Header --------------------------- The index file header is identical in organization to the main file header described above. The file length stored in the index file header is the total length of the index file in 16-bit words (the fifty 16-bit words of the header plus 4 times the number of records). Index Records ----------------- The I'th record in the index file stores the offset and content length for the I'th record in the main file. Table 7 shows the fields in the file header with their byte position, value, type, and byte order. In the table, position is with respect to the start of the index file record. Table 7 Description of Index Records Byte Position Field Value Type Order Byte 0 Offset Offset Integer Big Byte 4 Content Length Content Length Integer Big The offset of a record in the main file is the number of 16-bit words from the start of the main file to the first byte of the record header for the record. Thus, the offset for the first record in the main file is 50, given the 100-byte header. The content length stored in the index record is the same as the value stored in the main file record header. Organization of the dBASE File ============================= The dBASE file contains any desired feature attributes or attribute keys to which other tables can be joined. Its format is a standard DBF file used by many table-based applications in Windows(TM) and DOS. Any set of fields can bepresent in the table. There are three requirements, as follows: The file name must have the same prefix as the shape and index file. Its suffix must be ".dbf". (See the example on page 3, in Naming Conventions.) The table must contain one record per shape feature. The record order must be the same as the order of shape features in the main (*.shp) file. Glossary ========= Key terms are defined below that will help you understand the concepts discussed in this document. ARC/INFO -------------- ARC/INFO software is designed for users who require a complete set of tools for processing and manipulating spatial data, including digitizing, editing, coordinate management, network analysis, surface modeling, and grid cell based modeling. ARC/INFO operates on a large variety of workstations and minicomputers. Using open standards and client/server architecture, ARC/INFO can act as a GIS server for ArcView clients. ArcCAD ----------- ArcCAD software brings the powerful functionality of ARC/INFO, the leading geographic information system (GIS) software, to AutoCAD, the leading Computer-Aided Design (CAD) software. ArcCAD adds GIS functionality such as geographic data entry and editing, selection and query, and spatial analysis and modeling. Data can be stored in either ARC/INFO or AutoCAD format. ArcView is included with ArcCAD. ArcView ----------- ArcView software is a powerful, easy-to-use desktop GIS that gives you the power to visualize, explore, query, and analyze data spatially. ArcView operates in both Windows desktop environments as well as a large variety of workstations. In addition to acting as a stand-alone GIS, ArcView can intercommunicate with other applications. Two examples include interfacing with a global positioning system (GPS) receiver to collect map data interactively or using ARC/INFO as a server to execute sophisticated analysis. Avenue --------- Avenue software is an object-oriented programming language and development environment created for use with ArcView software. Avenue can be used to extend ArcView software's basic capabilities and customize ArcView for specific applications. Avenue is available separately or bundled with ArcView at a special price. big endian byte order ------------------------- Left-to-right byte ordering of an integer word. This byte-ordering method is used on many UNIX systems including Sun, HewlettÐPackard(R), IBM(R), and Data General AViiON(R). bounding box ---------------- A bounding box is a rectangle surrounding each shape (e.g., arc) that is just large enough to contain the entire shape. It is defined as Xmin,Ymin, Xmax, Ymax. coverage ----------- 1. A digital version of a map forming the basic unit of vector data storage in ARC/INFO software. A coverage stores geographic features as primary features (such as arcs, nodes, polygons, and label points) and secondary features (such as tics, map extent, links, and annotation). Associated feature attribute tables describe and store attributes of the geographic features. 2. A set of thematically associated data considered as a unit. A coverage usually represents a single theme such as soils, streams, roads, or land use. feature -------- A representation of a geographic feature which has both a spatial representation referred to as a "shape" and a set of attributes. index file ----------- An ArcView shapefile index file is a file that allows direct access to records in the corresponding main file. little endian byte order -------------------------- Right-to-left byte ordering of an integer word. This byte-ordering method is used on many operating file systems including DEC OSF/1(TM), DEC OpenVMS(TM), MSÐDOS(R), and Windows NT(TM). MultiPoint ------------ A single feature composed of a cluster of point locations and a single attribute record. The group of points represents the geographic feature. NumPoints ------------- The count of the number of x,y vertices contained in a shape. PC ARC/INFO ------------------- PC ARC/INFO is a full-featured GIS for PC compatibles. Like ARC/INFO software, PC ARC/INFO is used by organizations around the world for automating, managing, and analyzing geographic information. Attributes describing geographic features are stored as tabular files in dBASE format. PolyLine ----------- An ordered set of x,y vertices representing a line or boundary. ring ----- An ordered set of x,y vertices where the first vertex is the same location as the last vertex; a closed PolyLine or a polygon. shapefile ----------- An ArcView data set used to represent a set of geographic features such as streets, hospital locations, trade areas, and ZIP Code boundaries. Shapefiles can represent point, line, or area features. Each feature in a shapefile represents a single geographic feature and its attributes. theme ------- A user-defined set of geographic features. Data sources for themes in ArcView include coverages, grids, images, and shapefiles. Theme properties include the data source name, attributes of interest, a data classification scheme, and drawing methodology. topology ----------- The spatial relationships between connecting or adjacent coverage features (e.g., arcs, nodes, polygons, and points). For example, the topology of an arc includes its from- and to-nodes and its left and right polygons. Topological relationships are built from simple elements into complex elements: points (simplest elements) and arcs (sets of connected points) are used to represent more complex features such as areas (sets of connected arcs). Shapefiles do not explicitly record topology. Coverages represent geographic features as topological line graphs. Topology can be useful for many GIS modeling operations that do not require coordinates. For example, to find an optimal path between two points requires a list of the arcs that connect to each other and the cost to traverse each arc in each direction. Coordinates are only needed for drawing the path after it is calculated. vector -------- A Cartesian (i.e., x,y) coordinate-based data structure commonly used to represent geographic features. Each feature is represented as one or more vertices. Attributes are associated with the feature. Other data structures include raster (which associates attributes with a grid cell) and triangulated irregular networks (TINs) for surface representation. vertex -------- One of a set of ordered x,y coordinates that constitutes a line.