Skip to content

Conversation

@TheNeuralBit
Copy link
Member

This implementation only supports reading arrow files at the moment - but it works in node and in the browser.

I've implemented basic numeric types as well as Utf8, Date, List, and Struct. I included a couple of example node scripts, one that just dumps an arrow file's schema, and another that converts it to a (very poorly formatted) CSV. I also included an example of reading an arrow file in the browser, currently hosted here: https://theneuralbit.github.io/arrow/ after selecting an arrow file, that page should display the file contents in a table.

So far I've only tested it with this file: https://keybase.pub/hulettbh/example-csv.arrow, generated by my colleague @elahrvivaz

class NullableInt32Vector extends NullableSimpleVector<Uint32Array> { constructor(name: string) { super(Uint32Array, name); }; }
class NullableFloat32Vector extends NullableSimpleVector<Float32Array> { constructor(name: string) { super(Float32Array, name); }; }
class NullableFloat64Vector extends NullableSimpleVector<Float64Array> { constructor(name: string) { super(Float64Array, name); }; }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hate creating Nullable classes this way - but I couldn't figure out a better way. I'm definitely open to suggestions for improving this.

apache: {
arrow: any
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a WIP PR for adding TypeScript support to flatbuffers (#4065), which should make this definition unnecessary.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see a first approach. To better review it, I need to get a bit more familiar with TypeScript but also left some initial comments.

The main issue (which also Travis) has is that we are missing Apache license header in all files. That's a small policy requirements that needs to be taken care of.

Otherwise if you have the time, it might make sense to push this change to gerrit to better review it as it is a larger one. @wesm should be able to help you there.

<title>arrow.js browser test</title>
<meta charset="utf-8">
<style>
/* adapted from http://codepen.io/brennowilliam/pen/PwJoPq */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the license on the original code snippet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to review more closely on GitHub, but only after resolving the IP lineage problem. It seems like this link only involves CSS, is there anything else? If that's it then I recommend removing the offending code or replacing it with something with a known ASL 2.0-compatible license

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is only referring to the CSS, I'll just take it out for now

@@ -0,0 +1,84 @@
<!DOCTYPE html>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into an examples/ folder?

import { vectorFromField, Vector } from './types';

export function loadVectors(buf) {
var fileLength = buf.length, bb, footerLengthOffset, footerLength,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expects an Arrow file in the "random access" format (the other one currently available is the streaming format, see http://wesmckinney.com/blog/arrow-streaming-columnar/). It should be at least documented here that this is expected, better would be if we could include it in the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move both of these functions into an arrow.file module, and similar functions could be implemented in arrow.stream in the future. Or perhaps it should be arrow.io.[file,stream] to allow for other modules.

return footer;
}

function _arrayToInt(bb, offset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this Int64FromBytebuffer or similar. First I got confused and thought this was little to big-endian conversion.

@wesm
Copy link
Member

wesm commented Mar 10, 2017

sorry that I've been a bit delayed in review, I should get to this this weekend. I'm actually pretty excited to contribute to this implementation and learn TypeScript in the process

@wesm
Copy link
Member

wesm commented Mar 10, 2017

Can you open a JIRA and prefix the PR title with ARROW-XXX:?

@TheNeuralBit TheNeuralBit changed the title [WIP] Basic TypeScript Implementation ARROW-613: WIP TypeScript Implementation Mar 10, 2017
@TheNeuralBit
Copy link
Member Author

Sounds good @wesm - I've actually got more time to look at this on weekends anyway

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK as a starting point. Suffice to say that we cannot proceed any further without beginning a unit test suite. If we do any more refactoring without unit tests we'll just be chasing our tails

I would be fine with merging this in its current state and beginning to enumerate work to build out the implementation and test suite. @xhochy?

class NullableInt32Vector extends NullableSimpleVector<Uint32Array> { constructor(name: string) { super(Uint32Array, name); }; }
class NullableFloat32Vector extends NullableSimpleVector<Float32Array> { constructor(name: string) { super(Float32Array, name); }; }
class NullableFloat64Vector extends NullableSimpleVector<Float64Array> { constructor(name: string) { super(Float64Array, name); }; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's definitely get rid of these, doesn't need to be done in this patch though. Whether a field is not nullable or has null count zero, we can skip interacting with the validity bitmap. In C++, for non-nullable fields the bitmap is null so any calls to IsNull always return false (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L92).

We're going to want to implement the visitor pattern for loading and unloading the arrays (or vectors as they're called here and in Java) in TypeScript. see https://github.com/torokmark/design_patterns_in_typescript/blob/master/visitor/visitor.ts for an example of visitor pattern in TS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid doing an IsNull check on every access for Non-nullable types. I have a (possibly superstitious) belief that any extra work performed on an access will harm performance in Javascript. A benchmark could convince me otherwise though.

👍 for the visitor pattern. Looks like a good approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right about the performance, but OTTH non-nullable data occurs less frequently in the wild, so the performance benefits will be more seldom reaped. It will be useful to have benchmarks eventually vs. iterating through JSON data


class SimpleVector<T extends ArrayView> extends Vector {
protected dataView: T;
private TypedArray: {new(buffer: any, offset: number, length: number) : T, BYTES_PER_ELEMENT: number};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on TypedArray (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray) in general is going to be very limiting. I don't have a complete view of all the tools available in JS for dealing with binary data. TypedArray classes can be used internally as a convenience to access scalar values of primitive types.

Does the Node / JS Buffer type have the same semantics as the ArrowBuf and arrow::Buffer objects in Java and C++ respectively (reference-counted, zero-copy)? That would be really convenient if so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know TypedArray is the only game in town for dealing with binary data in the browser efficiently. We could certainly use some kind of library that wraps TypedArray though - to get the same interface as the Node Buffer type we could investigate something like this: https://github.com/feross/buffer

TypedArray and Buffer are both zero-copy (in that you can construct a new one with just a reference to an ArrayBuffer and an offset), but I dont think either one implements reference-counting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit sorry, when I said "relying on TypedArray" I meant exposing it in the public API (e.g. how will strings work)? I agree that it's the tool to use for interacting with binary buffers

@xhochy
Copy link
Member

xhochy commented Mar 12, 2017

@wesm looks fine but we'll definitely need tests in the next step. Go ahead with merging.

@wesm
Copy link
Member

wesm commented Mar 12, 2017

Sounds good. thanks!

@asfgit asfgit closed this in 492b3d4 Mar 12, 2017
@wesm
Copy link
Member

wesm commented Mar 12, 2017

While I hate to make a bikeshed comment, we might also want to rename the top level directory to "js" from "javascript" (seems to be the norm in some other projects)

jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Mar 15, 2017
This implementation only supports reading arrow files at the moment - but it works in node and in the browser.

I've implemented basic numeric types as well as Utf8, Date, List, and Struct. I included a couple of example node scripts, one that just dumps an arrow file's schema, and another that converts it to a (very poorly formatted) CSV. I also included an example of reading an arrow file in the browser, currently hosted here: https://theneuralbit.github.io/arrow/ after selecting an arrow file, that page _should_ display the file contents in a table.

So far I've only tested it with this file: https://keybase.pub/hulettbh/example-csv.arrow, generated by my colleague @elahrvivaz

Author: Brian Hulette <hulettbh@gmail.com>
Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#358 from TheNeuralBit/javascript and squashes the following commits:

74e8520 [Brian Hulette] added a few more license headers
ce81034 [Brian Hulette] Cleaned up TextDecoder/Utf8Vector
3f0d9f0 [Brian Hulette] Added missing runtime dependency
a5800d9 [Brian Hulette] Added docstrings for Vector functions
3839485 [Brian Hulette] Removed tabs
1f6dcf3 [Brian Hulette] Renamed _arrayToInt
3a92bdd [Brian Hulette] Added ASF Licence headers
8092810 [Brian Hulette] Moved index.html to examples/ directory
0f43270 [Brian Hulette] Replaced table style with an original, basic one
c221f74 [Brian Hulette] Create README.md
71e72df [Brian Hulette] Added support for the browser via webpack
00bb974 [Brian Hulette] Initial typescript implementation
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Closes apache#358. This only includes an Arrow version bump to pick up ARROW-1143

Author: Uwe L. Korn <uwe@apache.org>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#360 from wesm/PARQUET-1041 and squashes the following commits:

3f8f0bc [Wes McKinney] Bump Arrow version to master
c134bd6 [Uwe L. Korn] Fix int conversion
1def8a4 [Uwe L. Korn] PARQUET-1041: Support Arrow's NullArray

Change-Id: I4ea8b9055e5b625bc1e65125a0d4e24b1cb06b7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants