Skip to content

Fix node discovery to ignore unknown DruidServices#12157

Merged
jihoonson merged 8 commits intoapache:masterfrom
jihoonson:ignore-unparseable-druid-service
Jan 19, 2022
Merged

Fix node discovery to ignore unknown DruidServices#12157
jihoonson merged 8 commits intoapache:masterfrom
jihoonson:ignore-unparseable-druid-service

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Description

This PR fixes a bug that DiscoveryDruidNode cannot be created while deserializing it from a JSON when there is an unknown DruidService. This bug can be seen when you do rolling update. Suppose you have a custom MyDruidService that is added in a new version, v2. While you are upgrading your cluster from v1 to v2, other nodes who watch the node that is upgraded first and announces MyDruidService will not understand what MyDruidService is until they are upgraded to v2. In this case, this bug currently can fail node discovery. This PR fixes the bug by ignoring unknown DruidServices during deserialization of DiscoveryDruidNode.


Key changed/added classes in this PR
  • DiscoveryDruidNode

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@jihoonson jihoonson added the Bug label Jan 15, 2022
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 after CI

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

public class DiscoveryDruidNode
{
private static final Logger LOG = new Logger(DiscoveryDruidNode.class);
private static final TypeReference<Map<String, Object>> RAW_DRUID_SERVICE_TYPE =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reference to this static variable, do I miss something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It is a left over that I forgot to delete after I cleaned up my PR. I will delete it.

)
{
Map<String, DruidService> services = new HashMap<>();
if (rawServices != null && !rawServices.isEmpty()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use org.apache.druid.utils.CollectionUtils.isNullOrEmpty to check the rawServices

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that method cannot be used as Map is not a Collection.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Maybe we can add some overridden methods to this util class in the future.

Copy link
Copy Markdown
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Overall LGTM. It seems that after the previous PR was merged, the forbiddenapis plugin checks fine now, but there are some other errors reported during the CI process that don't seem to be related to the forbiddenapis plugin check, as follows:

  • Max number of retries[240] exceeded for Task[waiting for SQL metadata refresh]
  • Background operation retry gave up ConnectionLossException: KeeperErrorCode = ConnectionLoss
  • AssertionError: lists don't have the same size expected [2] but found [0]
  • RuntimeException: org.apache.druid.java.util.common.ISE: one or more queries failed

@jihoonson
Copy link
Copy Markdown
Contributor Author

jihoonson commented Jan 17, 2022

@clintropolis, @cryptoe, @FrankChen021, @asdf2014 thank you for the review. The change in this PR revealed a bug in DataNodeService that it has duplicate "type" properties in its serialized JSON, one for the subtype key of DruidService and another for serverType in DataNodeService. It seems that things happened to be working before this change as the subtype key is always written first in the serialized JSON and Jackson always picks up the first "type" property as the subtype key for deserialization. However, this doesn't work anymore because DiscoveryDruidNode now deserializes the JSON to a Map first and then converts it to a DruidService to ignore unknown service types. Since the map does not allow duplicate keys, the second "type" overwrites the first "type" during deserialization which caused integration test failures. This bug is a bit tricky to fix since we cannot just rename the "type" property of "DataNodeService". It will break rolling upgrade if we do that. To not break the rolling upgrade, we should ensure that the serialized JSON of DataNodeService has both duplicate "type" properties, but the "type" for the subtype key always appears first in the JSON. To do this, I added a custom serializer (DruidServiceSerializer) and a custom deserialization logic in DiscoveryDruidNode that uses StringObjectPairList as an intermediary form during deserialization. DruidServiceSerializer might be not required as Jackson seems to always write the subtype key first during serialization, but I added it anyway just in case where it behaves differently in some edge case. Please see the Javadoc of DataNodeService for more details and let me know what you think.

@FrankChen021
Copy link
Copy Markdown
Member

@jihoonson I see. The JsonTypeInfo annotation defined on the DruidService class tells the jackson to add a type property automatically during serialization. And this type property conflicts with the type field defined in DataNodeService.

I don't have a better suggestion considering the backward compatibility.

I'm wondering if there's any way that we can check or avoid such conflict.
How about we can add constraint on the property name of JsonTypeInfo annotation, for example it should be _type? Such name won't conflict with user defined field name because this naming style is not allowed in users' code.

@jihoonson
Copy link
Copy Markdown
Contributor Author

Yeah, I think it makes sense to use a special name as the subtype key such as "_type" or "@type" in the future. We cannot easily modify existing ones unfortunately though since it will break rolling upgrade due to the same reason as why I could not just get rid of the "type" from DataNodeService.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@clintropolis thank you for the review. @cryptoe @FrankChen021 @asdf2014 do you have more comments?

@jihoonson
Copy link
Copy Markdown
Contributor Author

@FrankChen021 thank you!

@jihoonson jihoonson merged commit cc2ffc6 into apache:master Jan 19, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants