From 8186f56801e44aef1bd8784010f4a679fe228a43 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Tue, 7 Jun 2022 15:40:36 +0200 Subject: [PATCH 01/12] current state of RLE doc --- docs/source/format/Columnar.rst | 79 +++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 62bb922afdd..e2bbb7628b8 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -765,6 +765,85 @@ application. We discuss dictionary encoding as it relates to serialization further below. +.. _run-length-encoded-layout: + +Run-Length-encoded Layout +------------------------- + +Run-Length is a data representation that represents data as sequences of the +same a, called runs. Each run is represented as a value, and an integer +describing how often this value is repeated. + +Any array can be run-length-encoded. A run-length encoded array has a single +buffer holding as many 32-bit integers, as there are runs. The actual values are +hold in a child array, which is just a regular array + +The dictionary is stored as an optional +property of an array. When a field is dictionary encoded, the values are +represented by an array of non-negative integers representing the index of the +value in the dictionary. The memory layout for a dictionary-encoded array is +the same as that of a primitive integer layout. The dictionary is handled as a +separate columnar array with its own respective layout. + +As an example, you could have the following data: :: + + type: Float32 + + [1.0, 1.0, 1.0, 1.0, null, 'null', 2.0] + +In Run-length-encoded form, this could appear as: + +:: + + * Length: 3, Null count: 0 + * Accumulated run lengths buffer: + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 4 | 6 | 7 | unspecified (padding) | + + * Children arrays: + + * values (Float32): + * Length: 3, Null count: 1 + * Validity bitmap buffer: + + |Byte 0 (validity bitmap) | Bytes 1-63 | + |-------------------------|-----------------------| + |00000101 | 0 (padding) | + + * Values buffer + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 1.0 | unspecified | 2.0 | unspecified (padding) | + + +Note that a dictionary is permitted to contain duplicate values or +nulls: + +:: + + data VarBinary (dictionary-encoded) + index_type: Int32 + values: [0, 1, 3, 1, 4, 2] + + dictionary + type: VarBinary + values: ['foo', 'bar', 'baz', 'foo', null] + +The null count of such arrays is dictated only by the validity bitmap +of its indices, irrespective of any null values in the dictionary. + +Since unsigned integers can be more difficult to work with in some cases +(e.g. in the JVM), we recommend preferring signed integers over unsigned +integers for representing dictionary indices. Additionally, we recommend +avoiding using 64-bit unsigned integer indices unless they are required by an +application. + +We discuss dictionary encoding as it relates to serialization further +below. + Buffer Listing for Each Layout ------------------------------ From 70ea2fabdb05c88cf30eaa570235b96309908cd9 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Tue, 7 Jun 2022 19:39:16 +0200 Subject: [PATCH 02/12] formatting --- docs/source/format/Columnar.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index e2bbb7628b8..919ec021564 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -808,15 +808,15 @@ In Run-length-encoded form, this could appear as: * Length: 3, Null count: 1 * Validity bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-63 | - |-------------------------|-----------------------| - |00000101 | 0 (padding) | + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-----------------------| + | 00000101 | 0 (padding) | - * Values buffer + * Values buffer - | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | - |-------------|-------------|-------------|-----------------------| - | 1.0 | unspecified | 2.0 | unspecified (padding) | + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 1.0 | unspecified | 2.0 | unspecified (padding) | Note that a dictionary is permitted to contain duplicate values or From 2ef0a24210d0c959d64a8f0ecc769c7e1dae6087 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Tue, 7 Jun 2022 19:42:30 +0200 Subject: [PATCH 03/12] minor fixes --- docs/source/format/Columnar.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 919ec021564..c74724fa0c9 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -789,13 +789,13 @@ As an example, you could have the following data: :: type: Float32 - [1.0, 1.0, 1.0, 1.0, null, 'null', 2.0] + [1.0, 1.0, 1.0, 1.0, null, null, 2.0] In Run-length-encoded form, this could appear as: :: - * Length: 3, Null count: 0 + * Length: 3, Null count: 2 * Accumulated run lengths buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | From 348600954fdc0e408637923c01ff01c8219b43e9 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Thu, 16 Jun 2022 15:35:08 +0200 Subject: [PATCH 04/12] replace copy-paste mistake with actual rle description --- docs/source/format/Columnar.rst | 50 ++++++++++----------------------- 1 file changed, 15 insertions(+), 35 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index c74724fa0c9..39de4f93682 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -775,20 +775,25 @@ same a, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. Any array can be run-length-encoded. A run-length encoded array has a single -buffer holding as many 32-bit integers, as there are runs. The actual values are -hold in a child array, which is just a regular array +buffer holding as many 32-bit integers, as there are runs. The actual values +are hold in a child array, which is just a regular array. -The dictionary is stored as an optional -property of an array. When a field is dictionary encoded, the values are -represented by an array of non-negative integers representing the index of the -value in the dictionary. The memory layout for a dictionary-encoded array is -the same as that of a primitive integer layout. The dictionary is handled as a -separate columnar array with its own respective layout. +The values in the parent array buffer represent the length of each run. They do +not hold the length of the respective run directly, but the accumulated length +of all runs from the first to the current one. This allows relatively efficient +random access from a logical index using binary search. The length of an +individual run can be determined by subtracting two adjacent values. + +A run has to have a length of at least 1. This means the values in the +accumulated run lengths buffer are all positive and in strictly ascending +order. + +An accumulated run length cannot be null, therefore the parent array has no +validity buffer. As an example, you could have the following data: :: type: Float32 - [1.0, 1.0, 1.0, 1.0, null, null, 2.0] In Run-length-encoded form, this could appear as: @@ -819,31 +824,6 @@ In Run-length-encoded form, this could appear as: | 1.0 | unspecified | 2.0 | unspecified (padding) | -Note that a dictionary is permitted to contain duplicate values or -nulls: - -:: - - data VarBinary (dictionary-encoded) - index_type: Int32 - values: [0, 1, 3, 1, 4, 2] - - dictionary - type: VarBinary - values: ['foo', 'bar', 'baz', 'foo', null] - -The null count of such arrays is dictated only by the validity bitmap -of its indices, irrespective of any null values in the dictionary. - -Since unsigned integers can be more difficult to work with in some cases -(e.g. in the JVM), we recommend preferring signed integers over unsigned -integers for representing dictionary indices. Additionally, we recommend -avoiding using 64-bit unsigned integer indices unless they are required by an -application. - -We discuss dictionary encoding as it relates to serialization further -below. - Buffer Listing for Each Layout ------------------------------ @@ -1036,7 +1016,7 @@ The ``Buffer`` Flatbuffers value describes the location and size of a piece of memory. Generally these are interpreted relative to the **encapsulated message format** defined below. -The ``size`` field of ``Buffer`` is not required to account for padding +The ``size`` field of ``Buffer`` is not required to account for paddingeng-career-mgmt bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set ``size`` to the actual memory size rather than the padded size. From 7c348addbda42a46176ffab23a8c2150f8283828 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Thu, 16 Jun 2022 23:40:52 +0200 Subject: [PATCH 05/12] small fixes from PR comments --- docs/source/format/Columnar.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 39de4f93682..a7022df09e2 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -771,12 +771,12 @@ Run-Length-encoded Layout ------------------------- Run-Length is a data representation that represents data as sequences of the -same a, called runs. Each run is represented as a value, and an integer +same value, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. Any array can be run-length-encoded. A run-length encoded array has a single -buffer holding as many 32-bit integers, as there are runs. The actual values -are hold in a child array, which is just a regular array. +buffer holding as many signed 32-bit integers, as there are runs. The actual +values are hold in a child array, which is just a regular array. The values in the parent array buffer represent the length of each run. They do not hold the length of the respective run directly, but the accumulated length From cba824e7bf0ff71d0b33b884628b356cfd616812 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Thu, 16 Jun 2022 23:41:41 +0200 Subject: [PATCH 06/12] hold -> held --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index a7022df09e2..b325e86eb8b 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -776,7 +776,7 @@ describing how often this value is repeated. Any array can be run-length-encoded. A run-length encoded array has a single buffer holding as many signed 32-bit integers, as there are runs. The actual -values are hold in a child array, which is just a regular array. +values are held in a child array, which is just a regular array. The values in the parent array buffer represent the length of each run. They do not hold the length of the respective run directly, but the accumulated length From f1e1a16c03254dbe56de7ae83b35cd70d2214ebf Mon Sep 17 00:00:00 2001 From: zagto Date: Mon, 27 Jun 2022 17:40:35 +0200 Subject: [PATCH 07/12] Apply suggestions from code review Co-authored-by: Weston Pace --- docs/source/format/Columnar.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index b325e86eb8b..732ac3dbc64 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -767,15 +767,15 @@ below. .. _run-length-encoded-layout: -Run-Length-encoded Layout +Run-Length Encoded Layout ------------------------- Run-Length is a data representation that represents data as sequences of the same value, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. -Any array can be run-length-encoded. A run-length encoded array has a single -buffer holding as many signed 32-bit integers, as there are runs. The actual +Any array can be run-length encoded. A run-length encoded array has a single +buffer holding a signed 32-bit integer for each run. The actual values are held in a child array, which is just a regular array. The values in the parent array buffer represent the length of each run. They do @@ -1016,7 +1016,7 @@ The ``Buffer`` Flatbuffers value describes the location and size of a piece of memory. Generally these are interpreted relative to the **encapsulated message format** defined below. -The ``size`` field of ``Buffer`` is not required to account for paddingeng-career-mgmt +The ``size`` field of ``Buffer`` is not required to account for padding bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set ``size`` to the actual memory size rather than the padded size. From e3d3e9b3812783f1826919cce788e785943c3682 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Mon, 27 Jun 2022 17:50:22 +0200 Subject: [PATCH 08/12] make rle parent length the logical length (code already works like this) --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 732ac3dbc64..11fa88191b7 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -800,7 +800,7 @@ In Run-length-encoded form, this could appear as: :: - * Length: 3, Null count: 2 + * Length: 7, Null count: 2 * Accumulated run lengths buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | From 77bb500871711ba0f8d861bc50ed53168b42b06f Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Thu, 25 Aug 2022 18:21:35 +0200 Subject: [PATCH 09/12] update columnar format doc --- docs/source/format/Columnar.rst | 38 +++++++++++++++++---------------- 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 11fa88191b7..063bbfb90e0 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -774,22 +774,20 @@ Run-Length is a data representation that represents data as sequences of the same value, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. -Any array can be run-length encoded. A run-length encoded array has a single -buffer holding a signed 32-bit integer for each run. The actual -values are held in a child array, which is just a regular array. +Any array can be run-length encoded. A run-length encoded array has no buffers +by itself, but has two child arrays. The first one holds a signed 32-bit integer +for each run. The actual values of each run are held the second child array. -The values in the parent array buffer represent the length of each run. They do +The values in the first child array represent the length of each run. They do not hold the length of the respective run directly, but the accumulated length -of all runs from the first to the current one. This allows relatively efficient -random access from a logical index using binary search. The length of an -individual run can be determined by subtracting two adjacent values. +of all runs from the first to the current one, i.e. the logical index where the +current run ends. This allows relatively efficient random access from a logical +index using binary search. The length of an individual run can be determined by +subtracting two adjacent values. A run has to have a length of at least 1. This means the values in the -accumulated run lengths buffer are all positive and in strictly ascending -order. - -An accumulated run length cannot be null, therefore the parent array has no -validity buffer. +run ends array all positive and in strictly ascending order. A run end cannot be +null. As an example, you could have the following data: :: @@ -801,15 +799,18 @@ In Run-length-encoded form, this could appear as: :: * Length: 7, Null count: 2 - * Accumulated run lengths buffer: + * Children arrays: - | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | - |-------------|-------------|-------------|-----------------------| - | 4 | 6 | 7 | unspecified (padding) | + * run ends (Int32): + * Length: 3, Null count: 0 + * Validity bitmap buffer: Not required + * Values buffer - * Children arrays: + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 4 | 6 | 7 | unspecified (padding) | - * values (Float32): + * values (Float32): * Length: 3, Null count: 1 * Validity bitmap buffer: @@ -843,6 +844,7 @@ of memory buffers for each layout. "Dense Union",type ids,offsets, "Null",,, "Dictionary-encoded",validity,data (indices), + "Run-length encoded",,, Logical Types ============= From b5016746457924a8b1f05bd93fc677c656be78f7 Mon Sep 17 00:00:00 2001 From: zagto Date: Tue, 20 Sep 2022 18:48:22 +0200 Subject: [PATCH 10/12] Update docs/source/format/Columnar.rst Co-authored-by: Andrew Lamb --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 063bbfb90e0..71493b3b8ad 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -775,7 +775,7 @@ same value, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. Any array can be run-length encoded. A run-length encoded array has no buffers -by itself, but has two child arrays. The first one holds a signed 32-bit integer +by itself, but has two child arrays. The first one holds a signed 32-bit integer called a "run end" for each run. The actual values of each run are held the second child array. The values in the first child array represent the length of each run. They do From 7211923ded8f1164105a1fc2233175230cb16922 Mon Sep 17 00:00:00 2001 From: zagto Date: Tue, 20 Sep 2022 18:48:55 +0200 Subject: [PATCH 11/12] Update docs/source/format/Columnar.rst Co-authored-by: Andrew Lamb --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 71493b3b8ad..e98f4075e77 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -785,7 +785,7 @@ current run ends. This allows relatively efficient random access from a logical index using binary search. The length of an individual run can be determined by subtracting two adjacent values. -A run has to have a length of at least 1. This means the values in the +A run must have have a length of at least 1. This means the values in the run ends array all positive and in strictly ascending order. A run end cannot be null. From 86e12d18d6fb020e1e69e669974642e13a6cbc92 Mon Sep 17 00:00:00 2001 From: Tobias Zagorni Date: Tue, 29 Nov 2022 19:19:45 +0100 Subject: [PATCH 12/12] Columnar doc: mention different bit-widths --- docs/source/format/Columnar.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index e98f4075e77..73a34e57873 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -775,8 +775,10 @@ same value, called runs. Each run is represented as a value, and an integer describing how often this value is repeated. Any array can be run-length encoded. A run-length encoded array has no buffers -by itself, but has two child arrays. The first one holds a signed 32-bit integer called a "run end" -for each run. The actual values of each run are held the second child array. +by itself, but has two child arrays. The first one holds a signed integer +called a "run end" for each run. The run ends array can hold either 16, 32, or +64-bit integers. The actual values of each run are held +the second child array. The values in the first child array represent the length of each run. They do not hold the length of the respective run directly, but the accumulated length