-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Reduce Pulsar IO Connectors size #9638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as far as CI passes
Not withstanding that reducing size is good, one thing we discussed yesterday, around moving connectors to separate repo, was to have the integration tests to just build a minimal image (not like pulsar-all) and use that for tests (as you already did in #9625). That image will be able to get cached within GH actions, because it won't contain all the connectors. |
|
/pulsarbot run-failure-checks |
3 similar comments
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
freeznet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/pulsarbot run-failure-checks |
2 similar comments
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
7d27d52 to
4cf685e
Compare
|
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safe to exclude all dependencies that are part of Pulsar Functions Worker's system classloader.
This is true for ThreadRuntime currently but not for ProcessRuntime / kubernetes runtijme. ThreadRuntime is also not doing the right thing since Ideally user provided code is isolated on completely separate classloader that only share common interfaces with the framework classloader.
While I am all for decreasing the sizes of connectors. This will have other consequences.
|
/pulsarbot run-failure-checks |
3 similar comments
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
@jerrypeng Thanks for pointing this out. I spent a few hours investigating this to get a better understanding how the classloaders are configured in Pulsar Functions. I found the documentation of the classloader hierarchy for Process Runtime in JavaInstanceMain class: Lines 34 to 47 in 00463ad
There's also logic that impacts how the classpaths get configured: pulsar/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/runtime/RuntimeUtils.java Lines 286 to 297 in 6497177
By default Because of the above logic where
There are integration tests which test both the process runtime and the thread runtime for some connectors. It seems like a major gap that not all connectors are tested with the thread runtime and the process runtime. You are right that there will be unknown consequences as long as there isn't test coverage for both scenarios. As explained in the previous observations of the thread vs. process/k8s runtime (no actual difference in classloading with default settings), the changes made in this PR are safe in general. |
Use provided scope when applicable
org.apache.logging.slf4j.Log4jLogger cannot be cast to ch.qos.logback.classic.Logger
4cf685e to
04f674c
Compare
|
btw. I noticed in some local tests that pulsar-io/flume connector is most likely broken. There aren't any integration tests to verify it's behavior. The Flume libraries depend on older Avro version which isn't compatible with the one used in Pulsar. That's why I reverted some changes I had made in pulsar-io/flume/pom.xml to mark Avro dependencies with provided . |
|
/pulsarbot run-failure-checks |
|
@lhotari Does NOT load the user's function JARs. It is suppose to load the Pulsar Function framework JARs thus it does load all of the Pulsar platform dependencies. The root classloader that contains only the interfaces in which the user defined function interacts with framework is defined here: and passed into the The root classloader is subsequently pass into the The When However, this is not the case for In general, for platform that supports third party plugins or executing user submitted code, it is best if classpaths are isolated and transitive dependencies are not shared across platform and user code. This will cause a lot of dependency versioning issues and limit what versions dependencies user submitted code can use. @lhotari I appreciate your effort to solve the issues with testing and to understand the Pulsar Function code. Perhaps we can find another solution here? Looking forward to working with you in the Pulsar community! |
|
@jerrypeng Thanks for your reply. I'll take a closer look tomorrow and go through the details. As long as we find a solution to get reduce the Pulsar IO Connectors size, I'm fine with that. |
|
Also because this PR: That separated pulsar-client-admin into two modules pulsar-client-admin and pulsar-client-admin-api. pulsar-cient-admin-api module pulls in a bunch of dependencies which doesn't make sense for a "api" module. The java-instance that gets loaded by the root classloader should only contain those few interfaces but now contain org.apache.pulsar:pulsar-functions-runtime-all:jar:2.7.0 This is also a problem. @lhotari this is why the tests pass for the connectors even though dependencies are not packaged explicitly. @sijie that PR has become very problematic for many reasons now. |
|
@lhotari if the primary purpose if to reduce the test image size, perhaps a simple solution would be just to build separate pulsar images with different connectors or sets of connectors and the integration tests can be modified to use the correct one when testing a connector. |
For my CI case it's the primary purpose. @merlimat suggested to write a spec as a GH issue for what you are proposing. However that doesn't resolve the problem that Pulsar users have. The pulsar-all image size is huge. Atm 3GB, but in 2.7 it's 2.25GB which is also too large. IIRC, There's about 400MB coming from duplicate connector jars (which are Pulsar core deps) also in 2.7 pulsar-all docker image. Isn't there a simple solution to get rid of the unnecessary jar files in the .nar files? |
|
@lhotari thanks for filling the issue.
Why do we have to package all the connectors into one image even for users? I doubt a user will need to use all connectors Pulsar has to offer. Perhaps we can have a strategy in which a user can choose which connectors, he or she as actually needs and that gets baked into the image. And for users that just want to use pub sub, I think there is already an image that doesn't contain any connectors.
The thing is NARs are suppose to be self contained which means all the transitive dependencies of the connector needs to be packaged in it as well. There are advantages to this as I mentioned before, however the downside is will duplicate dependencies across all connectors. However, this allows connectors evolve in a more of a vacuum than be dependent versions of libraries used by other connectors or pulsar. |
I agree on this point. However, because of the existing user base of pulsar-all image, it would be a breaking change from the user's perspective if support for pulsar-all is suddenly stopped without a migration path. There is a lot of usage for pulsar-all image. For example, the pulsar-helm-chart uses the pulsar-all image in the default configuration:
NARs might supposed to be self contained, but the technical implementation doesn't support this. It seems that the Pulsar dependencies are in the parent classloader of the NAR classloader in all configurations of Pulsar Functions: thread, process and k8s runtime. Because of this, any library that is part of Pulsar dependencies will get loaded from the parent classloader and not from the jar files embedded in the NAR file. That is the reason why adding any classes that are part of Pulsar dependencies is unnecessary duplication in the Pulsar connector .nar files. I'll spend some more time to verify with experimenting and testing whether it is the case or not. |
Again, this not true or not suppose to be true. I wrote and designed the classloading mechanism in functions so at least i know the intention if for user code to be loaded in a separate classloader than doesn't contain all of pulsar dependencies. However, currently because of issue such as #9246 , the classloader for user code maybe polluted with unnecessary deps. |
We don't have to stop building the pulsar-all image that contains all the connectors. At the same time, we also don't need to be keep using that image for our integration tests. We can also introduce different flavors of images that contain different sets for connectors. |
|
One more: #9817 |
sijie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologized for being late in reviewing this!
@lhotari I liked the motivation. I am also good with making pulsar-io-core as provided dependency. However, I see you also turn some of the dependencies shared between the Pulsar runtime and connector implementation to the provided scope. I am not comfortable with that change. Because even they are the same dependency, they will be loaded via different class loaders. I am not sure turning them into provided will not break any connector implementation.
If we really want to go down this approach, we should make changes to one connector first and make sure we have extensive integration tests to make sure we don't break the connector, before we make changes to all connectors.
@jerrypeng You are right that the problem is caused by PR #9246 (open issue #9640) . I ran some tests with a k8s cluster with Pulsar 2.7.0 . There, the classloaders are properly isolated and the classloaders contain only the minimum dependencies. Some output from my experiment: (btw. I created a full blown experiment in https://github.com/lhotari/pulsar-inspector-connector which created the classloader report above. I used as a tool to learn how to debug Pulsar Functions in a local k8s cluster since I didn't have the development environment setup for that previously.) On 2.8.0-SNAPSHOT, java-instance.jar contains: I hope #9640 gets addressed asap. |
@sijie Thank you for reviewing. Yes you are right that it's not fine to change the dependencies. I verified some details and wrote a report about that in the previous comment. It's indeed that #9246 (open issue #9640) is causing the issues and it also pollutes the classes that are available in the current parent classloader of the function user code classloader. It seems that it would be fine to make the dependencies I hope #9640 gets addressed asap since I think it could be considered as a blocker for the 2.8.0 release. |
Fixes #9572
Motivation
See #9572 for problem description. The total size of Pulsar IO files is currently 1952MB. The changes in #9246 made the problem worse, but the root cause of the problem is not #9246 . The Pulsar IO Connectors contained unnecessary files already before that change. This PR doesn't intend to fix the problems of leaked dependencies that is an issue caused by #9246 . There's a separate issue #9640 for handling the regression that #9246 caused.
It is safe to exclude all dependencies that are part of Pulsar Functions Worker's system classloader. The reason for this is that classloaders use parent-first lookups (by default, and also in Pulsar Functions Worker). The simplest way to do this exclusion is to use the
providedscope when applicable.Reducing the Pulsar IO files size is necessary to get the size of the pulsar-test-latest-version docker image size reduced so that it is feasible to transfer the docker image over the network in order to build the image once in the Pulsar CI GitHub Actions workflow and reuse the image in the integration tests that use that image.
Modifications