Skip to content

Conversation

@chihsinh
Copy link
Contributor

@chihsinh chihsinh commented Jul 2, 2025

This patch refines the current union operation to an internal LOP operation. Currently, two subsequent operations -- rbind() and unique(), are used to perform the union operation. We rewrite the operation with an internal LOP that uses a HashSet to compute the unique entries and returns them in a matrix. This improves the efficiency of the operation, as it avoids unique(). The order of the input entries is preserved in the output.

chihsinh added 3 commits July 2, 2025 17:04
This patch refines the current union operation to an internal LOP operation. Currently, two subsequent operations -- rbind() and unique(), are used to perform the union operation. We rewrite the operation with an internal LOP that uses a HashSet to compute the unique entries and returns them in a matrix. This improves the efficiency of the operation, as it avoids unique(). The order of the input entries is preserved in the output.
@mboehm7
Copy link
Contributor

mboehm7 commented Jul 2, 2025

Thanks for the patch @chihsinh - could you please fix the missing license headers, revert the replacement of the detailed imports with a wildcard import, and benchmark this hash map list of double implementation against a hash map of sliced out matrix blocks.

@mboehm7
Copy link
Contributor

mboehm7 commented Jul 3, 2025

Btw, the rewrite test seems to fail because rewriteSimplifyUnionDistinct.dml cannot be found - this likely due to the local tests running in Windows, which is case-insensitive, while the github action tests run in Linux which is case-sensitive. Please fix the capitalization but also rename to something else; otherwise git on windows will have problems because it cannot differentiate old and new files, again because of the case insensitivity.

@codecov
Copy link

codecov bot commented Jul 3, 2025

Codecov Report

Attention: Patch coverage is 95.19231% with 5 lines in your changes missing coverage. Please review.

Project coverage is 72.94%. Comparing base (64455b9) to head (ffe7fce).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...ds/runtime/instructions/cp/UnionCPInstruction.java 89.47% 1 Missing and 1 partial ⚠️
.../apache/sysds/runtime/matrix/data/MatrixBlock.java 97.01% 0 Missing and 2 partials ⚠️
.../rewrite/RewriteAlgebraicSimplificationStatic.java 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2286      +/-   ##
============================================
- Coverage     72.96%   72.94%   -0.02%     
- Complexity    46097    46109      +12     
============================================
  Files          1479     1480       +1     
  Lines        172654   172757     +103     
  Branches      33796    33818      +22     
============================================
+ Hits         125970   126025      +55     
- Misses        37192    37241      +49     
+ Partials       9492     9491       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mboehm7
Copy link
Contributor

mboehm7 commented Jul 12, 2025

LGTM - thanks for the patch @chihsinh. The code looked already pretty good. During the merge, I only removed the stdout printing from the instruction, and deduplicated the core kernels for single-column union and multi-column union a bit.

@mboehm7 mboehm7 closed this in 5aa0307 Jul 12, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in SystemDS PR Queue Jul 12, 2025
ghafek pushed a commit to ghafek/systemds that referenced this pull request Jul 14, 2025
This patch refines the current union operation to an internal LOP
operation. Currently, two subsequent operations -- rbind() and unique()
are used to perform the union operation. We rewrite the operation with
an internal LOP that uses a HashSet to compute the unique entries and
returns them in a matrix. This improves the efficiency of the
operation, as it avoids unique(). The order of the input entries is
preserved in the output.

Closes apache#2286.
ghafek pushed a commit to ghafek/systemds that referenced this pull request Nov 27, 2025
This patch refines the current union operation to an internal LOP
operation. Currently, two subsequent operations -- rbind() and unique()
are used to perform the union operation. We rewrite the operation with
an internal LOP that uses a HashSet to compute the unique entries and
returns them in a matrix. This improves the efficiency of the
operation, as it avoids unique(). The order of the input entries is
preserved in the output.

Closes apache#2286.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants