Skip to content

Enhance orc-extensions - use orc file schema#7282

Closed
es1220 wants to merge 1 commit intoapache:masterfrom
es1220:feature-enhance-orc-extensions
Closed

Enhance orc-extensions - use orc file schema#7282
es1220 wants to merge 1 commit intoapache:masterfrom
es1220:feature-enhance-orc-extensions

Conversation

@es1220
Copy link
Copy Markdown
Contributor

@es1220 es1220 commented Mar 18, 2019

orc-extensions uses custom struct typeString. (user configuration or druid parser auto making)

typeString is an unstable and has the potential to make a mistake. (such as column order, type ..)

So, I create DruidOrcNewInputFormat and druid_orc parser type.
Now, if you change only the inputFormat and parser type, you can easily ingest the orc file like a parquet-extensions without any typeString errors.

  • DruidOrcNewInputFormat
    • has OrcNewInputFormat
    • creates DruidOrcRecordReader and store file schema
  • DruidOrcRecordReader
    • converts OrcStruct to Map<String, Object> by stored file schema.
      (This has moved the existing process in OrcHadoopInputRowParser.)
  • DruidOrcHadoopInputRowParser
    • converts Map to MapBasedInputRow.

@clintropolis
Copy link
Copy Markdown
Member

Oops, it looks like we have duplicated some effort (see proposal #7134 and associated PR #7138), but at least it looks like we have the same goal in mind, eliminating usage of typeString 😅.

I am no doubt 100% biased on this, but I like my approach a bit better because it supports flattenSpec like JSON, Avro, and Parquet, and uses the org.apache ORC libraries instead of the Hive library which bundles a lot of unnecessary dependencies. My approach might also be easier to replicate the column renaming functionality that typeString provided, through using flattening expressions, maybe making it easier to maintain Druid ingestion schema when migrating from the current version of the ORC extension to the one provided in my PR. It would appear that my PR would also be compatible with what you are trying to achieve (I also support timeAndDims format parseSpec), but could you have a look and see if my approach would work for your use case?

@es1220
Copy link
Copy Markdown
Contributor Author

es1220 commented Mar 26, 2019

Your approach works well in my case. Thanks reply.

I close my PR #7282.

@es1220 es1220 closed this Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants