-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[FLINK-11391][shuffle] Introduce PartitionShuffleDescriptor and ShuffleDeploymentDescriptor #7631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fb30ec5 to
48e3aac
Compare
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 48e3aac (Fri Aug 23 22:37:40 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
azagrebin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhijiangW , I have left some comments for discussion.
| PartitionShuffleDescriptor psd = PartitionShuffleDescriptor.from(targetSlot, executionId, partition, maxParallelism); | ||
|
|
||
| producedPartitions.add(ResultPartitionDeploymentDescriptor.fromShuffleDescriptor(psd)); | ||
| getCurrentExecutionAttempt().cachePartitionShuffleDescriptor(partition.getIntermediateResult().getId(), psd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work if the complete TaskDeploymentDescriptor would be just cached as volatile field in Execution? Maybe we would not need any of three descriptors caches, what do think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From functional aspect, caching the TaskDeploymentDescriptor might also make sense. But I have other concerns:
-
The structure of TDD is complicated and would take more memory if caching completely, such as unnecessary fields
serializedJobInformation,serializedTaskInformation,etc. -
We might need adjust the current collection structure of
producedPartitions,inputGatesin TDD to map structure in order to find required PSD, SDD directly for other usages. -
If replacing the current three descriptors caches, we might not need the class of
PartialInputChannelDeploymentDescriptorany more if I understand correctly. But I wonder there exists such scenarios that during deploying consumer execution, only some input channel descriptors are unknown. During sending partition infos we only want to send these unknown infos, so how can we distinguish them from all the cached producer's TDD? In other words, the current cachedpartialInputChannelDeploymentDescriptorsmight be only a sub collection of all cached TDDs on producer side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
(and 2) ok, I agree, let's cache it similar to what we have now or maybe we could already prepare
Map<IntermediateDataSetID, PartitionInfo>on producer side? -
As I understand, the whole concurrent caching was done when consumer
deployandscheduleOrUpdateConsumershappened concurrently. Now, it should not be the case and all state transitions should happen on main thread of Job Master. So any cache maps could be just aHashMap.
As far as I see, there are 2 cases:
- deploy the consumer: here we rely on
partition.isConsumableflag to decide whether the input channel SDD/Location is known or not at the moment (scheduleOrUpdateConsumershas happened or not). scheduleOrUpdateConsumers:- here first of all,
partition.isConsumableflag is set to true before; - In
CREATEDandSCHEDULEDstate we do not have to do anything becauseTDDhas not been created yet and whenTDDis created indeploy(), it will use alreadypartition.isConsumable = trueto populate known locations; - then if some consumers are
DEPLOYINGorRUNNING, theirTDDs have been already sent with unknownSDD/Locations, so they have to be updated bysendUpdatePartitionInfoRpcCallusing cachedPartitionInfos, the update message will be applied in Task after deploy message.
- here first of all,
Currently in master, we call sendPartitionInfos to resolve previous race conditions in several places. Now we basically do not need it along with partialInputChannelDeploymentDescriptors.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with it. We can refactor this process simpler than before. Remove the class of PartialInputChannelDeploymentDescriptor and cache PartitionInfo directly for the case of DEPLOYING or RUNNING status of consumer to send the update during scheduleOrUpdateConsumers.
| private final LocationType locationType; | ||
|
|
||
| /** The connection to use to request the remote partition. */ | ||
| private final Optional<ConnectionID> connectionId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we would just have here ShuffleDeploymentDescriptor instead of ConnectionID. SDD also contains ConnectionID. If LocationType.Unknown is unknown, SDD field could be just special singleton implementation of ShuffleDeploymentDescriptor -> UnknownShuffleDeploymentDescriptor, or is it coming later?
Also, in ResultPartitionDeploymentDescriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also considered using ShuffleDeploymentDescriptor here to replace ConnectionID before. But there are two concerns in implementation:
-
In eager schedule mode when receiving all required slots, we might not assume the deployment sequence must be strict with topology sequence. That means the consumer execution deployment might be earlier than the producer execution. So in the process of
InputChannelDeploymentDescriptor#fromEdges, we might not get cached SDD directly from producer execution. But we can generateConnectionIDbased on other infos. Otherwise we must confirm the deployment sequence is from producer to consumer or generate producer's SDD during deploying consumer inInputChannelDeploymentDescriptor#fromEdges. -
I thought of introducing
UnknownShuffleDeploymentDescriptorbefore, but from semantic aspect it is a bit redundant withLocationType.Unknown. In addition, it seems no specific usages likeinstanceof UnknownShuffleDeploymentDescriptorin other processes. The SDD should be generated byShuffleMasterby design, but the specialUnknownShuffleDeploymentDescriptoris generated only in the case ofLocationType.Unknownwhich is not viaShuffleMaster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think I see the problem now, thanks for explanation. I will put my thoughts in other order :)
During the design, I thought ShuffleDeploymentDescriptor was supposed to contain shuffle specific info generated by ShuffleMaster as a central point and used eventually by ShuffleService in producer and consumer Task to setup readers/writers.
The example could be some partition identification or connection inside external shuffle system. The existing connection id/location is also an example of it for the existing netty stack, but might be not relevant for other shuffle systems.
For example, let's say the partition is stored remotely (not in producer), the batch job is restored and some the partition is finished, we do not even need to deploy the producer, just connect the consumer to the existing 'done' external partition, then the existing connection id does not make sense, the consumer needs some kind of internal shuffle id of the partition.
That is why I thought: PSD(ProducerResourceId,ProducerConnection,...) -> ShuffleMaster -> SDD(Internal) -> ICDD(SDD) -> Task -> ICDD,ConsumerResourceId -> ShuffleService -> InputGate -> read records.
I think even ShuffleService itself can decide what to do with ProducerResourceId/ConsumerResourceId and calculate internally LocationType in case of existing netty. For other shuffle services, LocationType might be not relevant (like external partition), then maybe ICDD=SDD=PartitionInfo and we could leave only one of them, not sure.
I thought of UnknownShuffleDeploymentDescriptor as a replacement of LocationType.Unknown\ConnectionId=null based on the above arguments. It is just a singleton stub to signal that SDD will be updated later with the sendUpdatePartitionInfoRpcCall in case of lazy scheduling. True, it is not generated by ShuffleMaster, what could be an alternative for this approach?
In case of eager deployment (lazyScheduling=false), currently, we can already deploy the consumer when the slot is assigned to the producer but its deployment has not started yet and we planned to generate the SDD during producer deployment. If we agree on 2., it seems that we need SDD for consumer to consume and it has to be known.
Thinking more about ShuffleMaster interface, depending on its nature, it might be an asynchronous API like registering and talking to an external system. This means that ideally its partition register method should return a CompletableFuture<SDD>.
Then the producer execution life cycle should be: created -> scheduled -> slot assigned -> register partition (get and cache SDD) -> deploying (generate TDD with previously acquired SDD). Everything happening on the main thread of Job Master. The consumer has to be deployed not after producer slot is assigned but after partition is registered in eager scheduling. In lazy scheduling, we have the sendUpdatePartitionInfoRpcCall to send SDD later.
I would suggest we do the partitions registering and SDD caching in allocateAndAssignSlotForExecution, right after slot assignment (needs rebase on the latest master):
return FutureUtils.handleAsyncIfNotDone(..tryAssignResource..)
.thenComposeAsync(
slot -> {..ShuffleMaster.register(PSD), cache SDDs..},
mainThreadExecutor);
Just maybe with refactoring the steps into different functions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the above comments can be summarized into three points:
-
LocationTypecan be decided byShuffleServiceby comparingResourceIDbetween producer and consumer. And theconsumerResourceIDcould be covered inIGDD. Regarding withICDD=SDD=PartitionInfo, I only concern the only different fieldIntermediateDataSetIDexisting inPartitionInfofor finding the properSingleInputGateto update partition info onTaskside. I would think through this step by step. -
registerPartitionandcachePartitionare triggered inallocateAndAssignSlotForExecutionand return future, then we can confirm during deploying the consumer in eager mode, we can always get corresponding registered/cachedSDDof producers. -
UnknownShuffleDeploymentDescriptorshould also be introduced for lazy deployment mode to indicate there would be updatedSDDlater.
I agree with the above points, especially for the point 2 which solves my previous concern naturally. :)
| /** | ||
| * Deployment descriptor for shuffle specific information. | ||
| */ | ||
| public class ShuffleDeploymentDescriptor implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually it needs to be an interface, probably an empty one. This one could stay an implementation for the default shuffle master. Also, special UnknownShuffleDeploymentDescriptor could extend the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might need an explicit method of getConnectionId in the interface if to do? Because the ICDD might either see UnknownShuffleDeploymentDescriptor or KnownShuffleDeploymentDescriptor and it should provide the way of getting ConnectionID if LocationType==Remote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, for existing netty shuffle, it has to have more methods. Internally, I would suggest, the future NettyShuffleService will cast SDD to KnownNettySDD if it is not an UnknownSDD:
interface SDD { }
enum UnknownSDD implements SDD { INSTANCE; } // special singleton stub
class KnownNettySDD implements SDD { + ProducerResourceId, ProducerConnection, etc }
// later:
class AnyOtherSDD implements SDD { other specific shuffle identification }
|
@azagrebin , thanks for your reviews! :) |
|
@azagrebin @zhijiangW is this PR still valid? Or was it subsumed by something else? |
|
@pnowojski |
|
No problem, it should be closed and I forgot it. |
What is the purpose of the change
This is a sub task for introducing
ShuffleMastercomponent on JM side based on pluggableShuffleManagerarchitecture.In the first step, we try to refactor the related information structures during deployment. So we introduce the
PartitionShuffleDescriptor (PSD)to cover all the necessary info which might come fromExecutionGraphdirectly or registration fromTM/ShuffleService.The
ShuffleDeploymentDescriptor (SDD)is also introduced for covering only shuffle specific info and SDD is created byShuffleMasterduringregisterPartitionProducer.PSD and SDD would be cached and used for generating
ResultPartitionDeploymentDescriptor (RPDD),InputGateDeploymentDescriptor (IGDD),InputChannelDeploymentDescriptor (ICDD), etc during producer/consumer task deployments. The relationship between them seems PSD+SDD = RPDD/IGDD/ICDD.In addition, we remove the
ResultPartitionLocationstructure to separate theConnectionIDandLocationTypeinfo. TheConnectionIDcan be regarded as shuffle specific info which would be covered in PSD, SDD, ICDD. AndLocationTypeis covered only in ICDD when both producer and consumer are deployed.Notes:
The
DefaultShuffleMasterhere is only for interacting with the related logics, and the formal implementation would be done in a separate pr.We might not confirm the deployment sequence of scheduler, that means it might exist the scenario of deploying consumer before producer. So we can not rely on the producer's PSD/SDD to generate IGDD/ICDD of consumer, and this part is still relying on the
ExecutionEdge.Brief change log
PSDSDDResultPartitionLocationscheduleOrUpdateConsumersVerifying this change
*The related tests would be added after reviewing to confirm current refactoring make sense. *
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation