-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-613: WIP TypeScript Implementation #358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently only supports reading files with node
| class NullableInt32Vector extends NullableSimpleVector<Uint32Array> { constructor(name: string) { super(Uint32Array, name); }; } | ||
| class NullableFloat32Vector extends NullableSimpleVector<Float32Array> { constructor(name: string) { super(Float32Array, name); }; } | ||
| class NullableFloat64Vector extends NullableSimpleVector<Float64Array> { constructor(name: string) { super(Float64Array, name); }; } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really hate creating Nullable classes this way - but I couldn't figure out a better way. I'm definitely open to suggestions for improving this.
| apache: { | ||
| arrow: any | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a WIP PR for adding TypeScript support to flatbuffers (#4065), which should make this definition unnecessary.
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see a first approach. To better review it, I need to get a bit more familiar with TypeScript but also left some initial comments.
The main issue (which also Travis) has is that we are missing Apache license header in all files. That's a small policy requirements that needs to be taken care of.
Otherwise if you have the time, it might make sense to push this change to gerrit to better review it as it is a larger one. @wesm should be able to help you there.
javascript/index.html
Outdated
| <title>arrow.js browser test</title> | ||
| <meta charset="utf-8"> | ||
| <style> | ||
| /* adapted from http://codepen.io/brennowilliam/pen/PwJoPq */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the license on the original code snippet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to review more closely on GitHub, but only after resolving the IP lineage problem. It seems like this link only involves CSS, is there anything else? If that's it then I recommend removing the offending code or replacing it with something with a known ASL 2.0-compatible license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is only referring to the CSS, I'll just take it out for now
| @@ -0,0 +1,84 @@ | |||
| <!DOCTYPE html> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into an examples/ folder?
| import { vectorFromField, Vector } from './types'; | ||
|
|
||
| export function loadVectors(buf) { | ||
| var fileLength = buf.length, bb, footerLengthOffset, footerLength, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expects an Arrow file in the "random access" format (the other one currently available is the streaming format, see http://wesmckinney.com/blog/arrow-streaming-columnar/). It should be at least documented here that this is expected, better would be if we could include it in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move both of these functions into an arrow.file module, and similar functions could be implemented in arrow.stream in the future. Or perhaps it should be arrow.io.[file,stream] to allow for other modules.
javascript/lib/arrow.ts
Outdated
| return footer; | ||
| } | ||
|
|
||
| function _arrayToInt(bb, offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this Int64FromBytebuffer or similar. First I got confused and thought this was little to big-endian conversion.
|
sorry that I've been a bit delayed in review, I should get to this this weekend. I'm actually pretty excited to contribute to this implementation and learn TypeScript in the process |
|
Can you open a JIRA and prefix the PR title with |
|
Sounds good @wesm - I've actually got more time to look at this on weekends anyway |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK as a starting point. Suffice to say that we cannot proceed any further without beginning a unit test suite. If we do any more refactoring without unit tests we'll just be chasing our tails
I would be fine with merging this in its current state and beginning to enumerate work to build out the implementation and test suite. @xhochy?
| class NullableInt32Vector extends NullableSimpleVector<Uint32Array> { constructor(name: string) { super(Uint32Array, name); }; } | ||
| class NullableFloat32Vector extends NullableSimpleVector<Float32Array> { constructor(name: string) { super(Float32Array, name); }; } | ||
| class NullableFloat64Vector extends NullableSimpleVector<Float64Array> { constructor(name: string) { super(Float64Array, name); }; } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's definitely get rid of these, doesn't need to be done in this patch though. Whether a field is not nullable or has null count zero, we can skip interacting with the validity bitmap. In C++, for non-nullable fields the bitmap is null so any calls to IsNull always return false (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L92).
We're going to want to implement the visitor pattern for loading and unloading the arrays (or vectors as they're called here and in Java) in TypeScript. see https://github.com/torokmark/design_patterns_in_typescript/blob/master/visitor/visitor.ts for an example of visitor pattern in TS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid doing an IsNull check on every access for Non-nullable types. I have a (possibly superstitious) belief that any extra work performed on an access will harm performance in Javascript. A benchmark could convince me otherwise though.
👍 for the visitor pattern. Looks like a good approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be right about the performance, but OTTH non-nullable data occurs less frequently in the wild, so the performance benefits will be more seldom reaped. It will be useful to have benchmarks eventually vs. iterating through JSON data
|
|
||
| class SimpleVector<T extends ArrayView> extends Vector { | ||
| protected dataView: T; | ||
| private TypedArray: {new(buffer: any, offset: number, length: number) : T, BYTES_PER_ELEMENT: number}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on TypedArray (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray) in general is going to be very limiting. I don't have a complete view of all the tools available in JS for dealing with binary data. TypedArray classes can be used internally as a convenience to access scalar values of primitive types.
Does the Node / JS Buffer type have the same semantics as the ArrowBuf and arrow::Buffer objects in Java and C++ respectively (reference-counted, zero-copy)? That would be really convenient if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know TypedArray is the only game in town for dealing with binary data in the browser efficiently. We could certainly use some kind of library that wraps TypedArray though - to get the same interface as the Node Buffer type we could investigate something like this: https://github.com/feross/buffer
TypedArray and Buffer are both zero-copy (in that you can construct a new one with just a reference to an ArrayBuffer and an offset), but I dont think either one implements reference-counting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit sorry, when I said "relying on TypedArray" I meant exposing it in the public API (e.g. how will strings work)? I agree that it's the tool to use for interacting with binary buffers
|
@wesm looks fine but we'll definitely need tests in the next step. Go ahead with merging. |
|
Sounds good. thanks! |
|
While I hate to make a bikeshed comment, we might also want to rename the top level directory to "js" from "javascript" (seems to be the norm in some other projects) |
This implementation only supports reading arrow files at the moment - but it works in node and in the browser. I've implemented basic numeric types as well as Utf8, Date, List, and Struct. I included a couple of example node scripts, one that just dumps an arrow file's schema, and another that converts it to a (very poorly formatted) CSV. I also included an example of reading an arrow file in the browser, currently hosted here: https://theneuralbit.github.io/arrow/ after selecting an arrow file, that page _should_ display the file contents in a table. So far I've only tested it with this file: https://keybase.pub/hulettbh/example-csv.arrow, generated by my colleague @elahrvivaz Author: Brian Hulette <hulettbh@gmail.com> Author: Brian Hulette <brian.hulette@ccri.com> Closes apache#358 from TheNeuralBit/javascript and squashes the following commits: 74e8520 [Brian Hulette] added a few more license headers ce81034 [Brian Hulette] Cleaned up TextDecoder/Utf8Vector 3f0d9f0 [Brian Hulette] Added missing runtime dependency a5800d9 [Brian Hulette] Added docstrings for Vector functions 3839485 [Brian Hulette] Removed tabs 1f6dcf3 [Brian Hulette] Renamed _arrayToInt 3a92bdd [Brian Hulette] Added ASF Licence headers 8092810 [Brian Hulette] Moved index.html to examples/ directory 0f43270 [Brian Hulette] Replaced table style with an original, basic one c221f74 [Brian Hulette] Create README.md 71e72df [Brian Hulette] Added support for the browser via webpack 00bb974 [Brian Hulette] Initial typescript implementation
Closes apache#358. This only includes an Arrow version bump to pick up ARROW-1143 Author: Uwe L. Korn <uwe@apache.org> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#360 from wesm/PARQUET-1041 and squashes the following commits: 3f8f0bc [Wes McKinney] Bump Arrow version to master c134bd6 [Uwe L. Korn] Fix int conversion 1def8a4 [Uwe L. Korn] PARQUET-1041: Support Arrow's NullArray Change-Id: I4ea8b9055e5b625bc1e65125a0d4e24b1cb06b7a
This implementation only supports reading arrow files at the moment - but it works in node and in the browser.
I've implemented basic numeric types as well as Utf8, Date, List, and Struct. I included a couple of example node scripts, one that just dumps an arrow file's schema, and another that converts it to a (very poorly formatted) CSV. I also included an example of reading an arrow file in the browser, currently hosted here: https://theneuralbit.github.io/arrow/ after selecting an arrow file, that page should display the file contents in a table.
So far I've only tested it with this file: https://keybase.pub/hulettbh/example-csv.arrow, generated by my colleague @elahrvivaz