Skip to content

Fix num_rows in sys.segments#6888

Merged
jihoonson merged 17 commits intoapache:masterfrom
surekhasaharan:fix-num-rows
Feb 12, 2019
Merged

Fix num_rows in sys.segments#6888
jihoonson merged 17 commits intoapache:masterfrom
surekhasaharan:fix-num-rows

Conversation

@surekhasaharan
Copy link
Copy Markdown

@surekhasaharan surekhasaharan commented Jan 20, 2019

num_rows column of sys.segments was incorrectly reported for some case. It happened because the segmentMetadataInfo in DruidSchema was overwritten in addSegment when the segment had more than one replica. This PR fixes:

  • segmentMetadataInfo update in DruidSchema
  • Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
  • Rename setSegmentSignature to setSegmentMetadataHolder
  • Replace Map<String, Set<String>> with Set<String> for num_replica

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica
@fjy fjy added this to the 0.14.0 milestone Jan 20, 2019
@fjy fjy added the Bug label Jan 20, 2019
@jihoonson
Copy link
Copy Markdown
Contributor

@surekhasaharan would you please fix the conflicts?

// DataSource -> Segment -> SegmentMetadataHolder(contains RowSignature) for that segment.
// Use TreeMap for segments so they are merged in deterministic order, from older to newer.
// This data structure need to be accessed in a thread-safe way since SystemSchema accesses it
// This data structure need to be accessed in a thread-safe way via lock Object
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still better to say why the concurrency issue happens too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comments

{
Map<DataSegment, SegmentMetadataHolder> segmentsMetadata = schema.getSegmentMetadata();
Set<DataSegment> segments = segmentsMetadata.keySet();
Assert.assertEquals(segments.size(), 3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments have an order. The expected value should be the first argument.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point

segmentsMetadata = schema.getSegmentMetadata();
existingSegment = segments.stream().findFirst().orElse(null);
final SegmentMetadataHolder currentHolder = segmentsMetadata.get(existingSegment);
Assert.assertEquals(updatedHolder.getNumRows(), currentHolder.getNumRows());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check other fields as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was for numRows, but sure can add other fields as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of unit tests is to check the an entire state of a snapshot rather than a particular value of the snapshot. There's no reason to check individual values in different tests under the same situation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

}

@Test
public void testSegmentMetadataHolderNumRows()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add a comment about what this method is testing? It's not intuitive from the code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

Set<DataSegment> segments = segmentsMetadata.keySet();
Assert.assertEquals(segments.size(), 3);
DataSegment existingSegment = segments.stream().findFirst().orElse(null);
Assert.assertFalse(existingSegment == null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.assertNotNull(existingSegment)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ImmutableDruidServer server = null;
for (ImmutableDruidServer druidServer : druidServers) {
for (DataSegment segment : druidServer.getSegments()) {
if (segment == existingSegment) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks that this is to find a druidServer holding existingSegment. Why not

druidServers.stream()
                .flatMap(druidServer -> druidServer.getSegments().stream())
                .filter(segment -> segment.equals(existingSegment))
                .findAny()
                .orElse(null)

? You also may want to make existingSegments final.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think you're comparing these segments with == for performance reason. Please use equals() instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use streams api instead of nested loops

2L, //partition_num
1L, //num_replicas
3L, //numRows
2L, //numRows
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please explain what this change is for?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is because I created another ROWS3 and index3 above, such that now (segment3, index3) is added to walker, to make the test code more intuitive.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


private void addSegment(final DruidServerMetadata server, final DataSegment segment)
@VisibleForTesting
protected void addSegment(final DruidServerMetadata server, final DataSegment segment)
Copy link
Copy Markdown
Contributor

@justinborromeo justinborromeo Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better as package private instead of protected to minimize scope? Since there's no subclasses in other packages, is there a need to be protected?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, no need for this and others to be protected, can be changed to default package access.


private void setSegmentSignature(final DataSegment segment, final SegmentMetadataHolder segmentMetadataHolder)
@VisibleForTesting
protected void setSegmentMetadataHolder(final DataSegment segment, final SegmentMetadataHolder segmentMetadataHolder)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be package-private?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

public long getNumReplicas()
{
return segmentServerMap.get(segmentId).size();
return segmentServers.size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return long in this method when Set#size() returns an int? Might not be worth changing because it might blow up other stuff.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be int, and then got changed to long at some point. The reason is because we support nulls for long here, but not for ints ? Not very sure...but will likely leave it to be long here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@surekhasaharan thanks for the quick fix. I left some trivial comments.

// Use TreeMap for segments so they are merged in deterministic order, from older to newer.
// This data structure need to be accessed in a thread-safe way since SystemSchema accesses it
// This data structure need to be accessed in a thread-safe way via lock Object since segments can be added,
//removed, refreshed or accessed asynchronously
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a space after //.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah will add at all the places

segmentsMetadata = schema.getSegmentMetadata();
existingSegment = segments.stream().findFirst().orElse(null);
final SegmentMetadataHolder currentHolder = segmentsMetadata.get(existingSegment);
Assert.assertEquals(updatedHolder.getNumRows(), currentHolder.getNumRows());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of unit tests is to check the an entire state of a snapshot rather than a particular value of the snapshot. There's no reason to check individual values in different tests under the same situation.

2L, //partition_num
1L, //num_replicas
3L, //numRows
2L, //numRows
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

.orElse(null);
Assert.assertNotNull(existingSegment);
final SegmentMetadataHolder existingHolder = segmentsMetadata.get(existingSegment);
//update SegmentMetadataHolder of existingSegment with numRows=5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please make the style of comments consistent. I think it's better to add a space after // because most of comments do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other comments.

private static final int MAX_SEGMENTS_PER_QUERY = 15000;
private static final long IS_PUBLISHED = 0;
private static final long IS_AVAILABLE = 1;
private static final long NUM_ROWS = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three should really be called DEFAULT_IS_PUBLISHED, DEFAULT_IS_AVAILABLE, and DEFAULT_NUM_ROWS.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, renamed

// Use TreeMap for segments so they are merged in deterministic order, from older to newer.
// This data structure need to be accessed in a thread-safe way since SystemSchema accesses it
// This data structure need to be accessed in a thread-safe way via lock Object
private final Map<String, TreeMap<DataSegment, SegmentMetadataHolder>> segmentMetadataInfo = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this comment, you could add @GuardedBy("lock"). It sends the same message, and could also help with automated bug detection.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, thanks.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI. @surekhasaharan thanks!

@jihoonson jihoonson merged commit 02ef14f into apache:master Feb 12, 2019
jon-wei pushed a commit to implydata/druid-public that referenced this pull request Feb 12, 2019
* Fix the bug with num_rows in sys.segments

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica

* Remove unnecessary code and update test

* Add unit test for num_rows

* PR comments

* change access modifier to default package level

* minor changes to comments

* PR comments
surekhasaharan pushed a commit to surekhasaharan/druid that referenced this pull request Feb 12, 2019
* Fix the bug with num_rows in sys.segments

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica

* Remove unnecessary code and update test

* Add unit test for num_rows

* PR comments

* change access modifier to default package level

* minor changes to comments

* PR comments
surekhasaharan pushed a commit to surekhasaharan/druid that referenced this pull request Feb 12, 2019
* Fix the bug with num_rows in sys.segments

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica

* Remove unnecessary code and update test

* Add unit test for num_rows

* PR comments

* change access modifier to default package level

* minor changes to comments

* PR comments
jihoonson pushed a commit that referenced this pull request Feb 12, 2019
* Fix the bug with num_rows in sys.segments

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica

* Remove unnecessary code and update test

* Add unit test for num_rows

* PR comments

* change access modifier to default package level

* minor changes to comments

* PR comments
surekhasaharan pushed a commit to implydata/druid-public that referenced this pull request Feb 13, 2019
* Fix the bug with num_rows in sys.segments

* Fix segmentMetadataInfo update in DruidSchema
* Add numRows to SegmentMetadataHolder builder's constructor, so it's not overwritten
* Rename SegSegmentSignature to setSegmentMetadataHolder and fix it so nested map is appended instead of recreated
* Replace Map<String, Set<String>> segmentServerMap with Set<String> for num_replica

* Remove unnecessary code and update test

* Add unit test for num_rows

* PR comments

* change access modifier to default package level

* minor changes to comments

* PR comments
@surekhasaharan surekhasaharan deleted the fix-num-rows branch February 20, 2019 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants