-
Notifications
You must be signed in to change notification settings - Fork 435
feat(storage): implement multi stream manager for async downloads #15857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @v-pratap, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the asynchronous download capabilities for Google Cloud Storage by introducing a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a MultiStreamManager to implement scalable multi-stream reading for async downloads, which is a significant feature for improving performance. The implementation includes pooling and reusing gRPC connections, load-balancing requests, and maintaining a proactive background connection. The changes are well-structured, with the new logic encapsulated in MultiStreamManager and integrated into ObjectDescriptorImpl. The accompanying tests are thorough and cover the new complex asynchronous logic well. My feedback focuses on a few opportunities to simplify the API of the new MultiStreamManager class for better clarity and maintainability.
| void MoveActiveRanges(StreamIterator from, StreamIterator to) { | ||
| to->active_ranges = std::move(from->active_ranges); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| private: | ||
| std::list<Stream> streams_; | ||
| StreamFactory stream_factory_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream_factory_ member seems to be unused in the primary use case within ObjectDescriptorImpl. The responsibility of creating streams lies with ObjectDescriptorImpl, which then adds them to the manager. This suggests that stream creation is not a core responsibility of MultiStreamManager in this design.
Consider removing stream_factory_ and simplifying the constructors. For example:
// In multi_stream_manager.h
// Remove the factory-based constructor.
// explicit MultiStreamManager(StreamFactory stream_factory) ...
// Simplify this constructor to not take a factory.
explicit MultiStreamManager(std::shared_ptr<StreamT> initial_stream) {
streams_.push_back(Stream{std::move(initial_stream), {}});
}
// ... private members ...
// std::list<Stream> streams_;
// StreamFactory stream_factory_; // <-- remove thisThe test code would need a small adjustment:
// In multi_stream_manager_test.cc
static Manager MakeManager() {
return Manager(std::make_shared<FakeStream>());
}This change would make the class's responsibility clearer: it only manages streams, it doesn't create them.
| streams_.push_back( | ||
| Stream{std::move(stream), {}, resume_policy_prototype_->clone()}); | ||
| stream_manager_ = std::make_unique<StreamManager>( | ||
| []() -> std::shared_ptr<ReadStream> { return nullptr; }, // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no-op lambda returning nullptr highlights that the stream_factory parameter in MultiStreamManager's constructor is not used in this context. This is related to my suggestion on google/cloud/storage/internal/async/multi_stream_manager.h to simplify the MultiStreamManager API, which would remove the need for this parameter.
This PR implements scalable multi-stream reading for Zonal Buckets by introducing a MultiStreamManager to pool and reuse gRPC connections. It mitigates performance regressions from frequent stream creation by reusing idle streams and load-balancing requests, while maintaining a proactive background connection to minimize latency during ramp-up.
Class UML diagrams of the final architecture: Link
Benchmarking script used: Link
Bechmarking the changes by reading 500 GB data (thread count = 1),
Single steam: 1.14 GB/s
Multi stream with Reuse (poc code by vaibhav): 1.13 GB/s
Multi stream with Reuse (current PR code): 1.14 GB/s
Bechmarking the changes by reading 500 GB data (thread count = 30),
Single steam: 1.09 GB/s
Multi stream with Reuse (poc code by vaibhav): 1.48 GB/s
Multi stream with Reuse (current PR code): 1.49 GB/s
Benchmarking by changing the MaximumRangeSizeOption values,
64 MB size: 1.52 GB/s
128 MB size: 1.59 GB/s
256 MB size: 1.53 GB/s
Code accurate Control flow diagram for this snippet,