diff --git a/.gitignore b/.gitignore index aa67d3d37a..2ef152e04a 100644 --- a/.gitignore +++ b/.gitignore @@ -13,7 +13,6 @@ target *.orig *.rej dependency-reduced-pom.xml -parquet-scrooge/.cache .idea/* target/ .cache diff --git a/README.md b/README.md index ab8bd7b0d5..7d9046b0d1 100644 --- a/README.md +++ b/README.md @@ -66,10 +66,10 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration +* Cascading integration (deprecated) * Crunch integration * Apache Arrow integration -* Apache Scrooge integration +* Scrooge integration (deprecated) * Impala integration (non-nested) * Java Map/Reduce API * Native Avro support @@ -92,7 +92,7 @@ Note that to use an Input or Output format, you need to implement a WriteSupport We've implemented this for 2 popular data formats to provide a clean migration path as well: ### Thrift -Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge) sub-project. +Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project. ### Avro Avro conversion is implemented via the [parquet-avro](https://github.com/apache/parquet-mr/tree/master/parquet-avro) sub-project. diff --git a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java index 6bad970dec..2375a6df6c 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java @@ -67,8 +67,7 @@ public class ThriftReadSupport extends ReadSupport { /** * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such - * as Twiter's Scrooge, a custom converter can be specified using this key - * (for example, ScroogeRecordConverter from parquet-scrooge). + * as Twiter's Scrooge, a custom converter can be specified using this key. */ private static final String RECORD_CONVERTER_CLASS_KEY = "parquet.thrift.converter.class"; @@ -77,8 +76,7 @@ public class ThriftReadSupport extends ReadSupport { /** * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such - * as Twiter's Scrooge, a custom converter can be specified - * (for example, ScroogeRecordConverter from parquet-scrooge). + * as Twiter's Scrooge, a custom converter can be specified. * * @param conf a mapred jobconf * @param klass a thrift class @@ -93,8 +91,7 @@ public static void setRecordConverterClass(JobConf conf, /** * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such - * as Twiter's Scrooge, a custom converter can be specified - * (for example, ScroogeRecordConverter from parquet-scrooge). + * as Twiter's Scrooge, a custom converter can be specified. * * @param conf a configuration * @param klass a thrift class diff --git a/parquet_cascading.md b/parquet_cascading.md deleted file mode 100644 index 0eeaceb135..0000000000 --- a/parquet_cascading.md +++ /dev/null @@ -1,163 +0,0 @@ - - -Parquet Cascading Integration -============================= - -This document details the support of reading and writing parquet format from cascading. - -1. Read and Write -============== - -In [parquet-cascading](https://github.com/apache/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures. - -1.1 Thrift/TBase ------------- -### Read Thrift Records from Parquet -[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg. - -` -Scheme sourceScheme = new ParquetTBaseScheme(Name.class) -Tap source = new Hfs(sourceScheme, parquetInputPath); -` - -In the above example Name is a thrift class that extends TBase. Under the hood parquet will generate a schema from the thrift class to decode the data. - -The thrift class is actually *optional* to initialize a ParquetTBaseScheme when the data is written as Thrift records in Parquet. When writing thrift records to parquet format, the Thrift class of the records is stored as meta-data in the footer of the parquet file. Therefore when reading the file, if a thrift class is not explicitly provided, Parquet will use the class name stored in the footer as the thrift class. - -### Write Thrift Records to Parquet -[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided. - -` -Scheme sinkScheme = new ParquetTBaseScheme(Name.class); -Tap sink = new Hfs(sinkScheme, parquetOutputPath); -` - -For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTBaseScheme.java) - -1.2 Scrooge ------------ -### Read Scrooge records from Parquet -Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/apache/parquet-mr/blob/master/parquet-scrooge/src/main/java/org/apache/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly. - -` -Scheme sinkScheme = new ParquetScroogeScheme(Name.class); -Tap sink = new Hfs(sinkScheme, parquetOutputPath); -` - -### Write Scrooge Records to Parquet(Not supported yet) - -1.3 Tuples ----------- -### Read Cascading Tuples from Parquet -Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTupleScheme.java): - -` -Scheme sourceScheme = new ParquetTupleScheme(new Fields("last_name")); -Tap source = new Hfs(sourceScheme, parquetInputPath); -` - -### Write Cascading Tuples to Parquet(coming soon) - -For more examples please refer to [TestParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTupleScheme.java) - -2. Projection Pushdown -====================== -One of the big benefit of using columnar format is to be able to read only a subset of columns when the full schema is huge. It saves times by not reading unused columns. - -Parquet support projection pushdown for Thrift records and tuples. - -### 2.1 Projection Pushdown with Thrift/Scrooge Records -To read only a subset of columns in a Thrift/Scrooge class, the columns of interest should be specified using a glob syntax. - -For example, imagine a Person struct defined as: - - struct Person { - 1: required string name - 2: optional int16 age - 3: optional Address primaryAddress - 4: required map otherAddresses - } - - struct Address { - 1: required string street - 2: required string zip - 3: required PhoneNumber primaryPhone - 4: required PhoneNumber secondaryPhone - 4: required list otherPhones - } - - struct PhoneNumber { - 1: required i32 areaCode - 2: required i32 number - 3: required bool doNotCall - } - -A column is specified as the path from the root of the schema down to the field of interest, separated by `.`, just as you would access the field -in java or scala code. For example: `primaryAddress.primaryPhone.doNotCall`. -This applies for repeated fields as well, for example `primaryAddress.otherPhones.number` selects all the `number`s from all the elements of `otherPhones`. -Maps are a special case -- the map is split into two columns, the key and the value. All the columns in the key are required, but you can select a subset of the -columns in the value (or skip the value entirely), for example: `otherAddresses.{key,value.street}` will select only the streets from the -values of the map, but the entire key will be kept. To select an entire map, you can do: `otherAddresses.{key,value}`, -and to select only the keys: `otherAddresses.key`. Similar to map keys, the values in a set cannot be partially projected, -you must select all the columns of the items in the set, or none of them. This is because materializing the set wouldn't make much sense if the item's -hashcode is dependent on the dropped columns (as with the key of a map). - -When selecting a field that is a struct, for example `primaryAddress.primaryPhone`, -it will select the entire struct. So `primaryAddress.primaryPhone.*` is redundant. - -Columns can be specified concretely (like `primaryAddress.primaryPhone.doNotCall`), or a restricted glob syntax can be used. -The glob syntax supports only wildcards (`*`) and glob expressions (`{}`). - -For example: - - * `name` will select just the `name` from the Person - * `{name,age}` will select both the `name` and `age` from the Person - * `primaryAddress` will select the entire `primaryAddress` struct, including all of its children (recursively) - * `primaryAddress.*Phone` will select all of `primaryAddress.primaryPhone` and `primaryAddress.secondaryPhone` - * `primaryAddress.*Phone*` will select all of `primaryAddress.primaryPhone` and `primaryAddress.secondaryPhone` and `primaryAddress.otherPhones` - * `{name,age,primaryAddress.{*Phone,street}}` will select `name`, `age`, `primaryAddress.primaryPhone`, `primaryAddress.secondaryPhone`, and `primaryAddress.street` - -Multiple Patterns: -Multiple glob expression can be joined together separated by ";". eg. `name;primaryAddress.street` will match only name and street in Address. -This is useful if you want to combine a list of patterns without making a giant `{}` group. - -Note: all possible glob patterns must match at least one column. For example, if you provide the glob: `a.b.{c,d,e}` but only columns `a.b.c` and `a.b.d` exist, an -exception will be thrown. - -You can provide your projection globs to parquet by setting `parquet.thrift.column.projection.globs` in the hadoop config, or using the methods in the -scheme builder classes. - -### 2.2 Projection Pushdown with Tuples -When using ParquetTupleScheme, specifying projection pushdown is as simple as specifying fields as the parameter of the constructor of ParquetTupleScheme: - - -3. Cascading 2.0 & Cascading 3.0 -================================ -Cascading 3.0 introduced a breaking interface change in the Scheme abstract class, which causes a breaking change in all scheme implementations. -The parquet-cascading3 directory contains a separate library for use with Cascading 3.0 - -A significant part of the code remains identical; this shared part is in the parquet-cascading-common23 directory, which is not a Maven module. - -You cannot use both parquet-cascading and parquet-cascading3 in the same Classloader, which should be fine as you cannot use both cascading-core 2.x and cascading-core 3.x in the same Classloader either. - - - - -`Scheme sourceScheme = new ParquetTupleScheme(new Fields("age"));` diff --git a/pom.xml b/pom.xml index 8c3cf66081..6ce8520aa5 100644 --- a/pom.xml +++ b/pom.xml @@ -77,8 +77,6 @@ 0.14.2 shaded.parquet 2.10.1 - 2.7.1 - 3.1.2 2.9.0 1.12.0 thrift @@ -461,7 +459,6 @@ **/*.parquet **/*.avro **/*.json - **/names.txt **/*.avsc **/*.iml **/*.log