Skip to content

Conversation

@siddharthteotia
Copy link
Contributor

@siddharthteotia siddharthteotia commented Oct 4, 2017

cc @jacques-n , @BryanCutler , @icexelloss , @elahrvivaz

Prototype code (WIP) for BaseNullableFixedWidthVector abstract class and the implementation of subclass NullableIntVector.

ValueVector -> FixedWidthVector -> BaseNullableFixedVector (implements FieldVector) -> NullableIntVector

(1) No mutators and accessors
(2) Minimal delegation -- directly works with buffers for validity and actual data.
(3) No code generation
(4) Majority of the functionality abstracted out in the base class.
(5) Short call stack on hot paths.
(6) Buffer sharing is not yet done -- planning to first flush out the basic interfaces and get the infrastructure ready with new hierarchy.
(7) Need to figure out how to implement getFieldBuffers, loadFieldBuffers and other related methods that rely on vectors being BufferBacked.

@siddharthteotia
Copy link
Contributor Author

Pushed new changes based on comments on design spec.

@elahrvivaz
Copy link
Contributor

elahrvivaz commented Oct 5, 2017

I like the general direction. would it make sense to have some of the basic getter/setting methods pulled out to the base class with type parameters? or would that be bad for performance... i.e.

BaseNullableFixedWidthVector<T> {
  public void set(int index, T value);
  public T get(int index);
}

One of the pain points I've experienced is no consistent get/set methods, so when working with different vectors they all have to be cast to the exact type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can go in base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get to only one specific transfer pair construction method (possibly protected/internal only)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid skipping over arrowbuf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a base vector method that does the while(realloc) pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this out of here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take these out of the vector classes and have shim classes that can be constructed for Mutator and Accessor if people want them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can move most of this to a base class and use a generic parameter? Just have the copyValueSafe be per class?

@icexelloss
Copy link
Contributor

@siddharthteotia This looks good so far. What I am concerned with are code duplication and maintainability if we hand write all the vector class. Can you maybe show (or describe) how NullableLongVector and IntVector would work? Specially for those setter/getter method.

I think @wesm suggested not having non-nullable/nullable vectors be two separate class to reduce the number of classes. We should discussion more on this.

@icexelloss
Copy link
Contributor

@elahrvivaz I think boxing/unboxing will be a performance issue if we use generics. So that's not an option here?

@icexelloss
Copy link
Contributor

@siddharthteotia

nit: Can you reformat the description in markdown? You can do list like this:

  • Something
  • Something else

@jacques-n
Copy link
Contributor

@icexelloss: My preference is also to avoid having separate int and nullable int vectors. Only have NullableIntVector. As far as duplication goes, I think the second commit here seems to be small enough number of things that we could skip the templates and just manage a little more code.

@elahrvivaz I'm not entirely clear on how the casting helps with generics. one of the holder methods could be made to follow this pattern. Can you provide an example pattern where the generic setter will help avoid any casting? (Or are you just talking about less casting?) We built the reader/writer interfaces specifically to avoid casting (in cases where people don't use generated code).

@wesm
Copy link
Member

wesm commented Oct 5, 2017

should we create a wip-java-refactor where this work can take place without disrupting master? I presume we will require multiple patches

@icexelloss
Copy link
Contributor

icexelloss commented Oct 5, 2017

@jacques-n I see. Yeah the NullableIntVector does seem manageable.

Edit: doesn't -> does

@jacques-n
Copy link
Contributor

@icexelloss, I was saying the NullableIntVector does seem manageable. Were you agreeing or disagreeing?

@icexelloss
Copy link
Contributor

@jacques-n Ahh, sorry, there was a typo! Yes I agree.

@elahrvivaz
Copy link
Contributor

@jacques-n I haven't been able to use the writer framework, as it doesn't support all the operations we're doing (like dictionary encoding and fixed size lists, if I remember correctly). Our main data structure is a GeoTools SimpleFeature, which is basically just an Array<Object> with well-defined types at each index. For a lot of our work, I can set things up front to be the right types, and then say have an array of Accessors and call .getObject() on them. However, Mutator doesn't have a generic setter method, so instead I have to cast down to the exact implementing class to call all the set methods.
@icexelloss possibly we could have the generic boxed methods, and specific unboxed methods in each impl? For me specifically I have to box everything anyway, but maybe that's not worth complicating the api for

@BryanCutler
Copy link
Member

BryanCutler commented Oct 5, 2017

Wow, this is great so far! I really like that you were able to able to simplify and rid of Accessor and Mutator and a +1 for me for having a single NullableIntVector class without codegen. I think these are fairly stable now, so maintenance for all the vector types shouldn't be a big headache.

Is it possible we could rename vector classes too NullableIntVector to just IntVector since there is just one now, or is that too confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get typewidth from fieldtype? Or maybe add a mappng in ArrowType that you could query here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I guess we need the TYPE_WIDTH in the concrete class for get/set methods anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a good time to remove the MinorTypes and just use FieldTypes - I believe @jacques-n discussed this a while ago, that they don't actually capture all the field information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could add a getType to return the ArrowType obj, that would be great. Something like what #972 proposed to avoid having to cast would be helpful. But I'd hate to add more work to this, maybe better to discuss after the main stuff is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found MinorType to be weird. But it's probably hard to remove due to compability? We could consider deprecating those though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to requirement 7 in the requirement doc. Let's keep this out of this PR and discuss separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, best to discuss separately. We will be merging multiple multiple patches to the new branch "java vector refactor" . We can do this once rest of the new code is stabilized, existing functionality is not broken etc.

Later on we can merge this branch into master once refactoring is done or this particular thing can be addressed directly on master once refactor branch has been merged. I am open to ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that ever be the case? They should always move together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have too many instance objects here. Let's figure out how to reduce.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid passing field type and storing Field. Field is actually pretty heap memory heavy and can be derived based on other values here.

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Oct 9, 2017

cc @jacques-n , @BryanCutler , @icexelloss , @elahrvivaz

Notes on recent changes:

Addressed some comments. Some in progress.

  • The current patch has the basic implementation of interfaces for Nullable Fixed and Variable Width vectors with corresponding concrete implementation of respective subclasses - NullableIntVector, NullableVarCharVector.

  • Question: do we need allocationMonitor in our vector code? It complicates the logic of allocateNewSafe() and realloc().

  • Question: do we need setting reader and writer index on arrow bufs inside vectors? I think it doesn't really matter and vector get/set methods should still work fine.

  • I have made BaseValueVector as the top level abstract super class for the new hierarchy. Moved some methods there. I am not yet sure if we should have a complete different top level class like BaseNullableVector or use BaseValueVector. I think this is important since regardless of fixed or variable width property of vectors, there is still some common behavior which can be abstracted out in top level class.

@siddharthteotia siddharthteotia changed the title ARROW-1473: [JAVA] initial prototype hierarchy ARROW-1473: [JAVA - WIP] initial prototype hierarchy Oct 9, 2017
@siddharthteotia
Copy link
Contributor Author

My initial thoughts are to solidify this prototype -- flush out all APIs with complete implementation and tests and address comments. Right now we have concrete implementation for NullableInt and NullableVarChar in this prototype.

Can we have a separate project trunk/branch where we could merge patches for this work? Later on we can move the changes to Arrow master branch?

As we do this, we also need to think about getting rid of existing concrete implementation of vectors that is code generated. For example, once this prototype patch gets merged, we will need to remove NullableInt and NullableVarchar classes that are code generated.

cc @wesm, @jacques-n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder is it easier to have a separate template for Legency Vectors instead of if statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about that. I am actually thinking if instead of having legacy vectors, can we just put all the new vectors in a different package. Changing the template code to generate legacy vectors and do delegations to new vectors is like a throwaway work because eventually we will get rid of template all-together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I have made the changes to template to delegate the ops for Nullable Int and Nullable VarChar. Building things now locally and trying to run tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "Use XXX instead" in deprecation warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnionVector is another mess. Should we keep UnionVector out of ARROW-1463?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor allocationMonitor to it's own class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to err on the side of less objects and less inner classes. So that is why thought we can maintain the allocationMonitor state through a simple counter. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to not have a object, but maybe put the allocationMonitor logic in base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have other vector types inheriting from BaseValueVector. Once the entire hierarchy under BaseValueVector is refactored, we can move few things to BaseValueVector. There are a few things which are duplicated between BaseNullableFixedWidthVector and BaseNullableVariableWidthVector. Later on they can be moved to top level base class -- BaseValueVector.

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I am leaning towards refactor allocationMonitor logic to a standalone class (if the logic is indeed the same among different vectors). I'd rather not maintaining the same logic in many different places...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just trying to phase out the work. Does the above explanation make some sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does make sense. It's fine. We can always refactor in a later patch if we want to.

Copy link
Member

@BryanCutler BryanCutler Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me @siddharthteotia to move it to a top level class later. Do you think you could add a comment in the code for places that you anticipate changing in later phases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add note(s) in the code. Thanks @BryanCutler and @icexelloss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function not performance critical so it's fine to recompute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. Are you saying we memoize the output of getNullCount()?

Copy link
Contributor

@icexelloss icexelloss Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, kind of, something like keeping a nullCount variable in the class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be surprising to user this function is not O(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's think about it.

Copy link
Contributor

@icexelloss icexelloss Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessarily to resolve in this PR. Can you put a TODO for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be same as getNullCount in fix length vector

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be final protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. I missed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this common/well-defined enough to be moved to BitVectorHelper?

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..Do we need the logger method here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm I see different vector has different logger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems generic enough, can this be in base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getMutator removed from the new API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one of the goals is to simplify the access and reduce the number of objects/classes. So all the getters and setters are directly accessible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the existing Spark code won' compile with this then. @BryanCutler or I can make the change in Spark, but in general I wonder if this is too disruptive.

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubling the number of getter/setter API is also not great, maybe we can add deprecated method and remove them in the next two releases after 0.8 or sth. This is more a question for @wesm and @jacques-n to if we are willing to break user code in 0.8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the potential plan is to have the following:
Legacy Nullable Vector
Nullable Vector -- the ones being implemented as part of refactoring

The latter one is not mutator/accessor based.

The existing consumers of these Nullable vectors should be able to continue using them through the LegacyNullable vector which is merely a delegation to new code for each API.

For example, if in Dremio code, we replace each reference to NullableIntVector with LegacyNullableIntVector, the code should still work without any surprises.

We can keep the legacy code around for now and then remove them all together (along with NullableValueVectors template) in the next release or so.

Copy link
Contributor

@icexelloss icexelloss Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to remove accessor/mutator here.

After thinking about this a bit more, I feel maintaining backward compatibility of the new vectors might not to possible. The reasons are because I think we want to rename some vectors properly:

  • DateDayVector, DateMillisVector ->DateVector
  • IntervalDayVector -> IntervalVector
  • Timestamp*Vector -> TimestampVector
  • MapVector -> StructVector

Since backward compatibility is broken anyway, we might as well just introduce new classes of vectors that is are slightly different (e.g, no accessor/mutator).

Also, I assume we would want ArrowReaders to return VectorSchemaRoot with new classes of vector in 0.8, is that correct?

Also, ArrowWriters should work with both legency and new vectors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I updated the requirement doc and added the renaming requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on renaming. Are we also planning on removing the "Nullable" prefix to vectors? like NullableIntVector -> IntVector. I know that's currently what the inner vectors are called but after the refactor it won't really make sense to call it NullableIntVector because the inner vectors will be gone and all vectors are nullable right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably right. @jacques-n , what do you think?

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to remove Mutator/Accessor, we should remove these function so it's compile time error rather than runtime error for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for that we need to remove the generic Accessor, Mutator interface all together from ValueVector and its subinterfaces.

Since Legacy vectors are having mutator/accessor based access -- we may still need these temporary 2 functions getMutator() and getAccessor() just to comply with the interface until Legacy vectors have been removed.

Copy link
Contributor

@icexelloss icexelloss Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has no public static methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only used in the java vector package to manipulate the validity buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a public util class that has no public methods to be weird. Maybe making this class package private if that's the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@siddharthteotia
Copy link
Contributor Author

The patch finally builds fine :) Running existing tests now.

@icexelloss
Copy link
Contributor

@siddharthteotia Can you create a branch and change this PR to be merged to that branch?

@wesm
Copy link
Member

wesm commented Oct 12, 2017

I created a java-vector-refactor branch for you to collaborate in https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=shortlog;h=refs/heads/java-vector-refactor

@icexelloss
Copy link
Contributor

icexelloss commented Oct 12, 2017 via email

asfgit pushed a commit that referenced this pull request Oct 14, 2017
Close #1164
Close #1198

Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
@siddharthteotia siddharthteotia changed the title ARROW-1473: [JAVA - WIP] initial prototype hierarchy ARROW-1463: [JAVA - WIP] refactoring Oct 14, 2017
@siddharthteotia
Copy link
Contributor Author

Closing this one.

siddharthteotia added a commit to siddharthteotia/arrow that referenced this pull request Nov 14, 2017
Close apache#1164
Close apache#1198

Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
wesm pushed a commit that referenced this pull request Nov 15, 2017
Close #1164
Close #1198

Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Close apache#1164
Close apache#1198

Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants