From d8d49c3f103d5f8103a964afd86480d7ed266447 Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Thu, 10 Jun 2021 11:47:49 -1000 Subject: [PATCH 1/5] ARROW-13036: Added recommendations on file extension. Fixed typo (build -> built). --- docs/source/format/Columnar.rst | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 102c3a73317..c9d9ceb864e 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1006,19 +1006,21 @@ message flatbuffer is read, you can then read the message body. The stream writer can signal end-of-stream (EOS) either by writing 8 bytes containing the 4-byte continuation indicator (``0xFFFFFFFF``) followed by 0 -metadata length (``0x00000000``) or closing the stream interface. +metadata length (``0x00000000``) or closing the stream interface. We +recommend the ".arrows" file extension for the streaming format although +in many cases these streams will not ever be stored as files. IPC File Format --------------- -We define a "file format" supporting random access that is build with -the stream format. The file starts and ends with a magic string -``ARROW1`` (plus padding). What follows in the file is identical to -the stream format. At the end of the file, we write a *footer* -containing a redundant copy of the schema (which is a part of the -streaming format) plus memory offsets and sizes for each of the data -blocks in the file. This enables random access any record batch in the -file. See `File.fbs`_ for the precise details of the file footer. +We define a "file format" supporting random access that is built with +the stream format. We recommend the ".arrow" extension for files. The +file starts and ends with a magic string ``ARROW1`` (plus padding). What +follows in the file is identical to the stream format. At the end of the +file, we write a *footer* containing a redundant copy of the schema (which +is a part of the streaming format) plus memory offsets and sizes for each +of the data blocks in the file. This enables random access any record batch +in the file. See `File.fbs`_ for the precise details of the file footer. Schematically we have: :: From 43ebde7c6c26f34eaf4549ea5a23a95cbb6bc5ec Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Tue, 15 Jun 2021 09:33:17 -1000 Subject: [PATCH 2/5] ARROW-13036: Addressing PR comments. --- docs/source/format/Columnar.rst | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index c9d9ceb864e..745dab07501 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1013,14 +1013,15 @@ in many cases these streams will not ever be stored as files. IPC File Format --------------- -We define a "file format" supporting random access that is built with -the stream format. We recommend the ".arrow" extension for files. The -file starts and ends with a magic string ``ARROW1`` (plus padding). What -follows in the file is identical to the stream format. At the end of the -file, we write a *footer* containing a redundant copy of the schema (which -is a part of the streaming format) plus memory offsets and sizes for each -of the data blocks in the file. This enables random access any record batch -in the file. See `File.fbs`_ for the precise details of the file footer. +We define a "file format" supporting random access that is an extension of +the stream format. The file starts and ends with a magic string ``ARROW1`` +(plus padding). What follows in the file is identical to the stream format. +At the end of the file, we write a *footer* containing a redundant copy of +the schema (which is a part of the streaming format) plus memory offsets and +sizes for each of the data blocks in the file. This enables random access any +record batch in the file. We recommend the ".arrow" extension for files +created with this format. See `File.fbs`_ for the precise details of the file +footer. Schematically we have: :: From c4f8461326235f459d5f160a9d23d95b08ed5fbb Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Tue, 15 Jun 2021 09:35:18 -1000 Subject: [PATCH 3/5] ARROW-13036: One last small tweak. --- docs/source/format/Columnar.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 745dab07501..ad897a8c20b 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1018,10 +1018,10 @@ the stream format. The file starts and ends with a magic string ``ARROW1`` (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a *footer* containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and -sizes for each of the data blocks in the file. This enables random access any -record batch in the file. We recommend the ".arrow" extension for files -created with this format. See `File.fbs`_ for the precise details of the file -footer. +sizes for each of the data blocks in the file. This enables random access to +any record batch in the file. See `File.fbs`_ for the precise details of the +file footer. We recommend the ".arrow" extension for files created with this +format. Schematically we have: :: From 3b5aa44cb9a4db23afe8f002e89b5d772294aa8c Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Tue, 15 Jun 2021 09:38:00 -1000 Subject: [PATCH 4/5] ARROW-13036: Still trying to get the right spot for the new sentence. --- docs/source/format/Columnar.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index ad897a8c20b..7db11ea7be5 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1020,8 +1020,7 @@ At the end of the file, we write a *footer* containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access to any record batch in the file. See `File.fbs`_ for the precise details of the -file footer. We recommend the ".arrow" extension for files created with this -format. +file footer. Schematically we have: :: @@ -1038,7 +1037,8 @@ should be defined in a ``DictionaryBatch`` before they are used in a file. Further more, it is invalid to have more than one **non-delta** dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in -the file footer. +the file footer. We recommend the ".arrow" extension for files created with +this format. Dictionary Messages ------------------- From c85adff2a431fb8390addecc938ea53115fdd9f9 Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Tue, 15 Jun 2021 09:39:22 -1000 Subject: [PATCH 5/5] ARROW-13036: Changed to one ' ' after the '.' for consistency with the rest of the file --- docs/source/format/Columnar.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 7db11ea7be5..52920a49b35 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1006,7 +1006,7 @@ message flatbuffer is read, you can then read the message body. The stream writer can signal end-of-stream (EOS) either by writing 8 bytes containing the 4-byte continuation indicator (``0xFFFFFFFF``) followed by 0 -metadata length (``0x00000000``) or closing the stream interface. We +metadata length (``0x00000000``) or closing the stream interface. We recommend the ".arrows" file extension for the streaming format although in many cases these streams will not ever be stored as files. @@ -1014,7 +1014,7 @@ IPC File Format --------------- We define a "file format" supporting random access that is an extension of -the stream format. The file starts and ends with a magic string ``ARROW1`` +the stream format. The file starts and ends with a magic string ``ARROW1`` (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a *footer* containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and @@ -1036,8 +1036,8 @@ should be defined in a ``DictionaryBatch`` before they are used in a ``RecordBatch``, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one **non-delta** dictionary batch per dictionary ID (i.e. dictionary replacement is not -supported). Delta dictionaries are applied in the order they appear in -the file footer. We recommend the ".arrow" extension for files created with +supported). Delta dictionaries are applied in the order they appear in +the file footer. We recommend the ".arrow" extension for files created with this format. Dictionary Messages