Skip to content

Rename partition spec fields#8507

Merged
gianm merged 5 commits intoapache:masterfrom
ccaominh:partition-spec-field-names
Sep 20, 2019
Merged

Rename partition spec fields#8507
gianm merged 5 commits intoapache:masterfrom
ccaominh:partition-spec-field-names

Conversation

@ccaominh
Copy link
Copy Markdown
Contributor

@ccaominh ccaominh commented Sep 11, 2019

Description

Rename partition spec fields to be consistent across the various types (hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment and maxRowsPerSegment instead of targetPartitionSize and maxSegmentSize. Consistent and clearer names are easier for users to understand and use.

Also fix various IntelliJ inspection warnings and doc spelling mistakes.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.

Rename partition spec fields to be consistent across the various types
(hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment
and maxRowsPerSegment in favor of targetPartitionSize and
maxSegmentSize. Consistent and clearer names are easier for users to
understand and use.

Also fix various IntelliJ inspection warnings and doc spelling mistakes.
!PartitionsSpec.isEffectivelyNull(targetPartitionSize) || !PartitionsSpec.isEffectivelyNull(maxRowsPerSegment),
"Either targetPartitionSize or maxRowsPerSegment must be specified"
(target.value == null) != (max.value == null),
"Exactly one of " + target.name + " or " + max.name + " must be present"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use StringUtils here, like

StringUtils.format("Exactly one of %s or %s must be present", target.name, max.name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of using StringUtils.format()? This string does not need anything locale-specific.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the style i have seen mostly used in codebase and looks better in my opinion.

Comment thread docs/ingestion/hadoop.md
|maxPartitionSize|Maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize.|no|
|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes|
|targetPartitionSize|Deprecated. Use `targetRowsPerSegment` instead. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|no|
|maxRowsPerSegment|Maximum number of rows to include in a partition. Defaults to 50% larger than the `targetPartitionSize`.|no|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs here could be more clear that the *PartitionSize and *RowsPerSegment parameters are equivalent, just with different names

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll explicitly call out the rename/equivalence.

public HashedPartitionsSpec(
@JsonProperty("targetPartitionSize") @Deprecated @Nullable Integer targetPartitionSize,
@JsonProperty("maxRowsPerSegment") @Nullable Integer maxRowsPerSegment,
@JsonProperty(PartitionsSpec.MAX_ROWS_PER_SEGMENT) @Nullable Integer maxRowsPerSegment,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the single dim spec, it seems like it would be better to call this targetRowsPerSegment to match targetPartitionSize, but that would mean deprecating yet another property

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, or it may be better to have a different rename for maxPartitionSize in the single dim spec, since targetPartitionSIze and maxRowsPerSegment were equivalent before

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to rename "targetPartitionSize" in HashedPartitionsSpec to "targetRowsPerSegment"? My understanding is that hashed partitions do not have a "target" size as it is really a "max" size (which is why the former is deprecated).

@jihoonson described to me that when #8141 did some refactoring, the behavior of "targetPartitionSize" in SingleDimensionPartitionsSpec was wrongly made to match that of "maxRowsPerSegment". This PR separates the two so that single dimension partitioning can be made to honor a "target" size.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, thanks for clarifying

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to rename "targetPartitionSize" in HashedPartitionsSpec to "targetRowsPerSegment"? My understanding is that hashed partitions do not have a "target" size as it is really a "max" size (which is why the former is deprecated).

Ah sorry, probably there was some misunderstanding. targetPartitionSize i nHashedPartitionsSpec (which I deprecated in favor of maxRowsPerSegment) is actually the target number of rows per segment. You can see how to compute numShards based on targetPartitionSize here. So, IMO, targetRowsPerSegment makes most sense for it too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll update HashedPartitionsSpec to deprecate maxRowsPerSegment and to add targetRowsPerSegment.

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI

@gianm gianm merged commit 99b6eed into apache:master Sep 20, 2019
@ccaominh ccaominh deleted the partition-spec-field-names branch September 20, 2019 21:06
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
@jon-wei jon-wei mentioned this pull request Dec 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants