diff --git a/specification/ORCv2.md b/specification/ORCv2.md index 73daf6e..9aaeb91 100644 --- a/specification/ORCv2.md +++ b/specification/ORCv2.md @@ -261,6 +261,8 @@ message Type { VARCHAR = 16; CHAR = 17; TIMESTAMP_INSTANT = 18; + GEOMETRY = 19; + GEOGRAPHY = 20; } // the kind of this type required Kind kind = 1; @@ -273,9 +275,84 @@ message Type { // the precision and scale for decimal optional uint32 precision = 5; optional uint32 scale = 6; + repeated StringPair attributes = 7; + // the attributes associated with the geometry type + optional GeometryType geometry = 8; + // Coordinate Reference System (CRS) for Geometry and Geography types + optional string crs = 8; + // Edge interpolation algorithm for Geography type + enum EdgeInterpolationAlgorithm { + SPHERICAL = 0; + VINCENTY = 1; + THOMAS = 2; + ANDOYER = 3; + KARNEY = 4; + } + optional EdgeInterpolationAlgorithm algorithm = 9; } ``` +#### Geometry & Geography Types + +##### Background + +The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and +Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM, +XYZM) are defined by [OpenGIS Implementation Specification for Geographic +information - Simple feature access - Part 1: Common architecture][sfa-part1], +from [OGC(Open Geospatial Consortium)][ogc]. + +The version of the OGC standard first used here is 1.2.1, but future versions +may also be used if the WKB representation remains wire-compatible. + +[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355 +[ogc]: https://www.ogc.org/standard/sfa/ + +###### Coordinate Reference System + +Coordinate Reference System (CRS) is a mapping of how coordinates refer to +locations on Earth. + +The default CRS `OGC:CRS84` means that the geospatial features must be stored +in the order of longitude/latitude based on the WGS84 datum. + +Custom CRS can be specified by a string value. It is recommended to use an +identifier-based approach like [Spatial reference identifier][srid]. + +For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound +by [-90, 90]. + +[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier + +###### Edge Interpolation Algorithm + +An algorithm for interpolating edges, and is one of the following values: + +* `spherical`: edges are interpolated as geodesics on a sphere. +* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) +* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970. +* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965. +* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/) + +###### CRS Customization + +CRS is represented as a string value. Writer and reader implementations are +responsible for serializing and deserializing the CRS, respectively. + +As a convention to maximize the interoperability, custom CRS values can be +specified by a string of the format `type:identifier`, where `type` is one of +the following values: + +* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself. +* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored. + +###### Coordinate Axis Order + +The axis order of the coordinates in WKB and bounding box stored here +follows the de facto standard for axis order in WKB and is therefore always +(x, y) where x is easting or longitude and y is northing or latitude. This +ordering explicitly overrides the axis order as specified in the CRS. + ### Column Statistics The goal of the column statistics is that for each column, the writer @@ -303,6 +380,7 @@ message ColumnStatistics { optional bool hasNull = 10; optional uint64 bytes_on_disk = 11; optional CollectionStatistics collection_statistics = 12; + optional GeospatialStatistics geospatial_statistics = 13; } ``` @@ -397,6 +475,88 @@ message BinaryStatistics { } ``` +Geometry and Geography columns store optional bounding boxes and list of +geospatial type codes from all values. + +**Bounding Box** + +A geospatial instance has at least two coordinate dimensions: X and Y for 2D +coordinates of each point. Please note that X is longitude/easting and Y is +latitude/northing. A geospatial instance can optionally have Z and/or M values +associated with each point. + +The Z values introduce the third dimension coordinate. Usually they are used to +indicate the height, or elevation. + +M values are an opportunity for a geospatial instance to express a fourth +dimension as a coordinate value. These values can be used as a linear reference +value (e.g., highway milepost value), a timestamp, or some other value as defined +by the CRS. + +Bounding box is defined as the thrift struct below in the representation of +min/max value pair of coordinates from each axis. Note that X and Y Values are +always present. Z and M are omitted for 2D geospatial instances. + +For the X values only, xmin may be greater than xmax. In this case, an object +in this bounding box may match if it contains an X such that `x >= xmin` OR +`x <= xmax`. This wraparound occurs only when the corresponding bounding box +crosses the antimeridian line. In geographic terminology, the concepts of `xmin`, +`xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`, +`southernmost` and `northernmost`, respectively. + +For Geography type, X and Y values are restricted to the canonical ranges of +[-180, 180] for X and [-90, 90] for Y. + +**Geospatial Types** + +A list of geospatial types from all instances in the Geometry or Geography +column, or an empty list if they are not known. + +This is borrowed from [geometry_types of GeoParquet][geometry-types] except that +values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code]. +Table below shows the most common geospatial types and their codes: + +| Type | XY | XYZ | XYM | XYZM | +| :----------------- | :--- | :--- | :--- | :--: | +| Point | 0001 | 1001 | 2001 | 3001 | +| LineString | 0002 | 1002 | 2002 | 3002 | +| Polygon | 0003 | 1003 | 2003 | 3003 | +| MultiPoint | 0004 | 1004 | 2004 | 3004 | +| MultiLineString | 0005 | 1005 | 2005 | 3005 | +| MultiPolygon | 0006 | 1006 | 2006 | 3006 | +| GeometryCollection | 0007 | 1007 | 2007 | 3007 | + +In addition, the following rules are applied: +- A list of multiple values indicates that multiple geospatial types are present (e.g. `[0003, 0006]`). +- An empty array explicitly signals that the geospatial types are not known. +- The geospatial types in the list must be unique (e.g. `[0001, 0001]` is not valid). + +[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159 +[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary + +```protobuf +// Bounding box for Geometry or Geography type in the representation of min/max +// value pair of coordinates from each axis. +message BoundingBox { + optional double xmin = 1; + optional double xmax = 2; + optional double ymin = 3; + optional double ymax = 4; + optional double zmin = 5; + optional double zmax = 6; + optional double mmin = 7; + optional double mmax = 8; +} + +// Statistics specific to Geometry or Geography type +message GeospatialStatistics { + // A bounding box of geospatial instances + optional BoundingBox bbox = 1; + // Geospatial type codes of all instances, or an empty list if not known + repeated int32 geospatial_types = 2; +} +``` + ### User Metadata The user can add arbitrary key/value pairs to an ORC file as it is @@ -1235,6 +1395,21 @@ Encoding | Stream Kind | Optional | Contents DIRECT | PRESENT | Yes | Boolean RLE | DIRECT | No | Byte RLE +## Geometry & Geography Columns + +Geometry and Geography data are encoded with a PRESENT stream, a DATA stream that records +the WKB-encoded geometry/geography data as binary, and a LENGTH stream that records +the number of bytes per a value. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Binary contents + | LENGTH | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Binary contents + | LENGTH | No | Unsigned Integer RLE v2 + # Indexes ## Row Group Index diff --git a/src/main/proto/orc/proto/orc_proto.proto b/src/main/proto/orc/proto/orc_proto.proto index 16c5523..1c38fc7 100644 --- a/src/main/proto/orc/proto/orc_proto.proto +++ b/src/main/proto/orc/proto/orc_proto.proto @@ -84,6 +84,27 @@ message CollectionStatistics { optional uint64 total_children = 3; } +// Bounding box for Geometry or Geography type in the representation of min/max +// value pair of coordinates from each axis. +message BoundingBox { + optional double xmin = 1; + optional double xmax = 2; + optional double ymin = 3; + optional double ymax = 4; + optional double zmin = 5; + optional double zmax = 6; + optional double mmin = 7; + optional double mmax = 8; +} + +// Statistics specific to Geometry or Geography type +message GeospatialStatistics { + // A bounding box of geospatial instances + optional BoundingBox bbox = 1; + // Geospatial type codes of all instances, or an empty list if not known + repeated int32 geospatial_types = 2; +} + message ColumnStatistics { optional uint64 number_of_values = 1; optional IntegerStatistics int_statistics = 2; @@ -97,6 +118,7 @@ message ColumnStatistics { optional bool has_null = 10; optional uint64 bytes_on_disk = 11; optional CollectionStatistics collection_statistics = 12; + optional GeospatialStatistics geospatial_statistics = 13; } message RowIndexEntry { @@ -216,6 +238,8 @@ message Type { VARCHAR = 16; CHAR = 17; TIMESTAMP_INSTANT = 18; + GEOMETRY = 19; + GEOGRAPHY = 20; } optional Kind kind = 1; repeated uint32 subtypes = 2 [packed=true]; @@ -224,6 +248,18 @@ message Type { optional uint32 precision = 5; optional uint32 scale = 6; repeated StringPair attributes = 7; + + // Coordinate Reference System (CRS) for Geometry and Geography types + optional string crs = 8; + // Edge interpolation algorithm for Geography type + enum EdgeInterpolationAlgorithm { + SPHERICAL = 0; + VINCENTY = 1; + THOMAS = 2; + ANDOYER = 3; + KARNEY = 4; + } + optional EdgeInterpolationAlgorithm algorithm = 9; } message StripeInformation {