feat(k8s): Add multi-processing support to PipelineStep macro#232
feat(k8s): Add multi-processing support to PipelineStep macro#232
Conversation
Automatically detect and configure multiprocessing based on pipeline configuration. When parallelism.multi_process.processes is specified: - Multiply CPU and memory resources by process count - Mount /dev/shm shared memory volume for IPC - Validate that only one segment specifies parallelism Fixes STREAM-707
Semver Impact of This PR🟡 Minor (new features) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Other
🤖 This preview updates automatically when you update the PR. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| cpu_per_process, | ||
| memory_per_process, | ||
| segment_id, | ||
| process_count, |
There was a problem hiding this comment.
Multiprocessing resources applied to wrong segments
High Severity
The process_count from get_multiprocess_config() is applied unconditionally to the current segment_id being deployed, without checking if that segment actually has parallelism configured. If segment 0 has no multiprocessing but segment 1 has processes: 4, deploying segment 0 would incorrectly receive 4x resources and the /dev/shm volume. The code retrieves segments_with_parallelism but never checks whether segment_id is in that list before applying resource scaling.
Additional Locations (1)
There was a problem hiding this comment.
Can you add a test to cover this case? I don't think this is a valid bug but what is expected to happen where multiple segments are passed in and the first one is not the one that requires parallelism?
There was a problem hiding this comment.
I was about to add the test but then I figured out that the comment is wrong.
This is an example of the pipeline with parallelism
pipeline:
segments:
- steps_config:
myinput:
starts_segment: True
bootstrap_servers: ["127.0.0.1:9092"]
parallelism: 1
parser:
# Parser is the beginning of the segment.
# All Map steps in the same segment are chained
# together in the same process.
#
# When adding a step to the segment that is not
# a map we need to create a new segment as these
# cannot be ran in a multi process step.
starts_segment: True
parallelism:
multi_process:
processes: 4
batch_size: 1000
batch_time: 0.2
mysink:
starts_segment: True
bootstrap_servers: ["127.0.0.1:9092"]
WE pass --segment-id = 0 to run it but technically speaking, the parallel segment is the second.
This is because segments do not have a sound semantics. We have segments in the segments list but we can also start segments implicitly inside a single step.
We need to figure this out, though we cannot assert that the segment with the parallelism config would be the one passed to the consumer as this is never the case.
| if not multi_process: | ||
| continue | ||
|
|
||
| processes = multi_process.get("processes") |
There was a problem hiding this comment.
Missing type check causes crash on non-dict multi_process
Medium Severity
The code checks isinstance(parallelism, dict) before calling .get() on it, but there's no corresponding check for multi_process. If multi_process is a truthy non-dict value (e.g., multi_process: true or multi_process: 1 in YAML), the condition if not multi_process passes, and then multi_process.get("processes") crashes with AttributeError: 'bool' object has no attribute 'get'.
Co-authored-by: Cursor <cursoragent@cursor.com>
84dc0d6 to
9fea014
Compare


Summary
Adds automatic multi-processing support to the PipelineStep Kubernetes macro. When a pipeline configuration specifies
parallelism.multi_process.processes, the macro now:/dev/shmshared memory volume required for Python multiprocessing IPCImplementation Details
Updated:
build_container()process_countparameter1000m CPU × 4 processes = 4000m CPU/dev/shmvolume mount whenprocess_count > 1Updated:
PipelineStep.run()dshmemptyDir volume withmedium: "Memory"to deploymentFixes
Fixes STREAM-707