-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1463: [JAVA - WIP] refactoring #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e90ccd9 to
cfa07e5
Compare
|
Pushed new changes based on comments on design spec. |
|
I like the general direction. would it make sense to have some of the basic getter/setting methods pulled out to the base class with type parameters? or would that be bad for performance... i.e. One of the pain points I've experienced is no consistent get/set methods, so when working with different vectors they all have to be cast to the exact type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can go in base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get to only one specific transfer pair construction method (possibly protected/internal only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid skipping over arrowbuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a base vector method that does the while(realloc) pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this out of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take these out of the vector classes and have shim classes that can be constructed for Mutator and Accessor if people want them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance we can move most of this to a base class and use a generic parameter? Just have the copyValueSafe be per class?
|
@siddharthteotia This looks good so far. What I am concerned with are code duplication and maintainability if we hand write all the vector class. Can you maybe show (or describe) how I think @wesm suggested not having non-nullable/nullable vectors be two separate class to reduce the number of classes. We should discussion more on this. |
|
@elahrvivaz I think boxing/unboxing will be a performance issue if we use generics. So that's not an option here? |
|
nit: Can you reformat the description in markdown? You can do list like this:
|
|
@icexelloss: My preference is also to avoid having separate int and nullable int vectors. Only have NullableIntVector. As far as duplication goes, I think the second commit here seems to be small enough number of things that we could skip the templates and just manage a little more code. @elahrvivaz I'm not entirely clear on how the casting helps with generics. one of the holder methods could be made to follow this pattern. Can you provide an example pattern where the generic setter will help avoid any casting? (Or are you just talking about less casting?) We built the reader/writer interfaces specifically to avoid casting (in cases where people don't use generated code). |
|
should we create a wip-java-refactor where this work can take place without disrupting master? I presume we will require multiple patches |
|
@jacques-n I see. Yeah the Edit: doesn't -> does |
|
@icexelloss, I was saying the NullableIntVector does seem manageable. Were you agreeing or disagreeing? |
|
@jacques-n Ahh, sorry, there was a typo! Yes I agree. |
|
@jacques-n I haven't been able to use the writer framework, as it doesn't support all the operations we're doing (like dictionary encoding and fixed size lists, if I remember correctly). Our main data structure is a GeoTools SimpleFeature, which is basically just an |
|
Wow, this is great so far! I really like that you were able to able to simplify and rid of Is it possible we could rename vector classes too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get typewidth from fieldtype? Or maybe add a mappng in ArrowType that you could query here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, I guess we need the TYPE_WIDTH in the concrete class for get/set methods anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a good time to remove the MinorTypes and just use FieldTypes - I believe @jacques-n discussed this a while ago, that they don't actually capture all the field information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could add a getType to return the ArrowType obj, that would be great. Something like what #972 proposed to avoid having to cast would be helpful. But I'd hate to add more work to this, maybe better to discuss after the main stuff is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found MinorType to be weird. But it's probably hard to remove due to compability? We could consider deprecating those though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to requirement 7 in the requirement doc. Let's keep this out of this PR and discuss separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, best to discuss separately. We will be merging multiple multiple patches to the new branch "java vector refactor" . We can do this once rest of the new code is stabilized, existing functionality is not broken etc.
Later on we can merge this branch into master once refactoring is done or this particular thing can be addressed directly on master once refactor branch has been merged. I am open to ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would that ever be the case? They should always move together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have too many instance objects here. Let's figure out how to reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid passing field type and storing Field. Field is actually pretty heap memory heavy and can be derived based on other values here.
|
cc @jacques-n , @BryanCutler , @icexelloss , @elahrvivaz Notes on recent changes: Addressed some comments. Some in progress.
|
|
My initial thoughts are to solidify this prototype -- flush out all APIs with complete implementation and tests and address comments. Right now we have concrete implementation for NullableInt and NullableVarChar in this prototype. Can we have a separate project trunk/branch where we could merge patches for this work? Later on we can move the changes to Arrow master branch? As we do this, we also need to think about getting rid of existing concrete implementation of vectors that is code generated. For example, once this prototype patch gets merged, we will need to remove NullableInt and NullableVarchar classes that are code generated. cc @wesm, @jacques-n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder is it easier to have a separate template for Legency Vectors instead of if statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about that. I am actually thinking if instead of having legacy vectors, can we just put all the new vectors in a different package. Changing the template code to generate legacy vectors and do delegations to new vectors is like a throwaway work because eventually we will get rid of template all-together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I have made the changes to template to delegate the ops for Nullable Int and Nullable VarChar. Building things now locally and trying to run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "Use XXX instead" in deprecation warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnionVector is another mess. Should we keep UnionVector out of ARROW-1463?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe refactor allocationMonitor to it's own class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to err on the side of less objects and less inner classes. So that is why thought we can maintain the allocationMonitor state through a simple counter. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to not have a object, but maybe put the allocationMonitor logic in base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have other vector types inheriting from BaseValueVector. Once the entire hierarchy under BaseValueVector is refactored, we can move few things to BaseValueVector. There are a few things which are duplicated between BaseNullableFixedWidthVector and BaseNullableVariableWidthVector. Later on they can be moved to top level base class -- BaseValueVector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I am leaning towards refactor allocationMonitor logic to a standalone class (if the logic is indeed the same among different vectors). I'd rather not maintaining the same logic in many different places...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just trying to phase out the work. Does the above explanation make some sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does make sense. It's fine. We can always refactor in a later patch if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me @siddharthteotia to move it to a top level class later. Do you think you could add a comment in the code for places that you anticipate changing in later phases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will add note(s) in the code. Thanks @BryanCutler and @icexelloss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function not performance critical so it's fine to recompute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. Are you saying we memoize the output of getNullCount()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, kind of, something like keeping a nullCount variable in the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be surprising to user this function is not O(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessarily to resolve in this PR. Can you put a TODO for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to be same as getNullCount in fix length vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be final protected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. I missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this common/well-defined enough to be moved to BitVectorHelper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm..Do we need the logger method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm I see different vector has different logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems generic enough, can this be in base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is getMutator removed from the new API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one of the goals is to simplify the access and reduce the number of objects/classes. So all the getters and setters are directly accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the existing Spark code won' compile with this then. @BryanCutler or I can make the change in Spark, but in general I wonder if this is too disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubling the number of getter/setter API is also not great, maybe we can add deprecated method and remove them in the next two releases after 0.8 or sth. This is more a question for @wesm and @jacques-n to if we are willing to break user code in 0.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the potential plan is to have the following:
Legacy Nullable Vector
Nullable Vector -- the ones being implemented as part of refactoring
The latter one is not mutator/accessor based.
The existing consumers of these Nullable vectors should be able to continue using them through the LegacyNullable vector which is merely a delegation to new code for each API.
For example, if in Dremio code, we replace each reference to NullableIntVector with LegacyNullableIntVector, the code should still work without any surprises.
We can keep the legacy code around for now and then remove them all together (along with NullableValueVectors template) in the next release or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to remove accessor/mutator here.
After thinking about this a bit more, I feel maintaining backward compatibility of the new vectors might not to possible. The reasons are because I think we want to rename some vectors properly:
- DateDayVector, DateMillisVector ->DateVector
- IntervalDayVector -> IntervalVector
- Timestamp*Vector -> TimestampVector
- MapVector -> StructVector
Since backward compatibility is broken anyway, we might as well just introduce new classes of vectors that is are slightly different (e.g, no accessor/mutator).
Also, I assume we would want ArrowReaders to return VectorSchemaRoot with new classes of vector in 0.8, is that correct?
Also, ArrowWriters should work with both legency and new vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I updated the requirement doc and added the renaming requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming. Are we also planning on removing the "Nullable" prefix to vectors? like NullableIntVector -> IntVector. I know that's currently what the inner vectors are called but after the refactor it won't really make sense to call it NullableIntVector because the inner vectors will be gone and all vectors are nullable right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's probably right. @jacques-n , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to remove Mutator/Accessor, we should remove these function so it's compile time error rather than runtime error for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for that we need to remove the generic Accessor, Mutator interface all together from ValueVector and its subinterfaces.
Since Legacy vectors are having mutator/accessor based access -- we may still need these temporary 2 functions getMutator() and getAccessor() just to comply with the interface until Legacy vectors have been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has no public static methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only used in the java vector package to manipulate the validity buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a public util class that has no public methods to be weird. Maybe making this class package private if that's the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
The patch finally builds fine :) Running existing tests now. |
dcf5ef3 to
27770fd
Compare
|
@siddharthteotia Can you create a branch and change this PR to be merged to that branch? |
|
I created a |
|
Thanks Wes!
…On Thu, Oct 12, 2017 at 11:50 AM Wes McKinney ***@***.***> wrote:
I created a java-vector-refactor branch for you to collaborate in
https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=shortlog;h=refs/heads/java-vector-refactor
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1164 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwbrN8s9m3b0w48sXOHPtQI8QvFE0QIks5srjU4gaJpZM4PuDJ7>
.
|
88b89fe to
94888e5
Compare
|
Closing this one. |
Close apache#1164 Close apache#1198 Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
Close apache#1164 Close apache#1198 Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
cc @jacques-n , @BryanCutler , @icexelloss , @elahrvivaz
Prototype code (WIP) for BaseNullableFixedWidthVector abstract class and the implementation of subclass NullableIntVector.
ValueVector -> FixedWidthVector -> BaseNullableFixedVector (implements FieldVector) -> NullableIntVector
(1) No mutators and accessors
(2) Minimal delegation -- directly works with buffers for validity and actual data.
(3) No code generation
(4) Majority of the functionality abstracted out in the base class.
(5) Short call stack on hot paths.
(6) Buffer sharing is not yet done -- planning to first flush out the basic interfaces and get the infrastructure ready with new hierarchy.
(7) Need to figure out how to implement getFieldBuffers, loadFieldBuffers and other related methods that rely on vectors being BufferBacked.