Support fancy iterators in cuda.parallel by rwgk · Pull Request #2788 · NVIDIA/cccl

rwgk · 2024-11-13T00:34:13Z

Description

closes #2479

closes #2480

closes #2536

Partially done: #2481

…rators.py (with unit tests).

The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

…el_itertools branch. The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

…to see if numba can still compile it).

…t then fails with: Fatal Python error: Floating point exception

…resolves the Floating point exception (but the `cccl_device_reduce()` call still does not succeed)

LOOOK single_tile_kernel CALL /home/coder/cccl/c/parallel/src/reduce.cu:116 LOOOK EXCEPTION CUDA error: invalid argument /home/coder/cccl/c/parallel/src/reduce.cu:703

…rametrize: `use_numpy_array`: `[True, False]`, `input_generator`: `["constant", "counting", "arbitrary", "nested"]`

…iterators.py (because numba.cuda cannot JIT classes).

… `unary_op`, which is then compiled with `numba.cuda.compile()`

… the `"map_mul2"` test and the added `"map_add10_map_mul2"` test works, too.

…conflicts).

…s_iterators branch.

…dInputIterator` `LOAD_CS`

…o make it obvious that they are never used as Python methods, but exclusively as source for `numba.cuda.compile()`

…plied ruff format to newly added code.

rwgk · 2024-12-05T19:01:09Z

/ok to test

github-actions · 2024-12-05T19:42:28Z

🟩 CI finished in 39m 02s: Pass: 100%/3 | Total: 36m 23s | Avg: 12m 07s | Max: 27m 08s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 7m 02s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 02s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 13s | Avg:  2m 13s | Max:  2m 13s
  🟩 Test               Pass: 100%/1   | Total:  7m 02s | Avg:  7m 02s | Max:  7m 02s

🟩 python: Pass: 100%/1 | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

…from_any(value_type)` function.

shwina

This is looking good! Mainly I have some nits, but one relatively important issue is requiring the output type in TransformIterator. We can choose to punt that to #3064 if needed.

shwina · 2024-12-05T23:37:55Z

+
+
+class ConstantIterator:
+    def __init__(self, val, ntype):


I think ntype->dtype would be better. The use of Numba should be an implementation detail from the user's perspective. Alternately, we could just accept a typed scalar like ConstantIterator(np.int32(0)).

Ditto for CountingIterator.

shwina · 2024-12-05T23:47:25Z

+def count_advance(this, diff):
+    this[0] += diff
+
+
+def count_dereference(this):
+    return this[0]


Coming to think about it, it might be better to make these @staticmethod. After all:

$ python -c "import this" | grep Namespace Namespaces are one honking great idea -- let's do more of those!

Done in commit c3c51a5

I added comments:

# Exclusively for numba.cuda.compile (this is not an actual method).

My thinking:

Not adding a decorator (as we had originally), people will think it's a bound method, but wonder why self is called this.

Explicitly adding @staticmethod will make people believe it really is a static method, but that's not actually true.

Being explicit in the comment is only slightly more verbose than adding a decorator but much more informative.

Explicitly adding @staticmethod will make people believe it really is a static method, but that's not actually true.

I don't think I understand. If anything, adding @staticmethod will make it even more obvious to the reader that the function is independent of the class. Typically functions that have no dependency on the class or its members, but are otherwise related to it are defined as @staticmethod.

In other words, these are truly staticmethods in every sense

shwina · 2024-12-06T00:12:49Z

+        return self.it.alignment  # TODO fix for stateful op
+
+
+def TransformIterator(op, it, op_return_ntype):


I don't think we should require the op_return_ntype here. Numba should in theory have everything it needs to infer the return type when compiling op.

cuda.compile returns both the LTOIR as well as the inferred return type, which we seem to be discarding in extract_ctypes_ltoirs.

Are we able to use the numba inferred return type and not require it from the user?

If not, it might be because numba doesn't have enough typing information. If that is the case, it will be fixed as part of #3064 by defining numba types corresponding to all of our Iteratortypes.

shwina · 2024-12-06T00:19:39Z

+    return 3 * val
+
+
+SUPPORTED_VALUE_TYPE_NAMES = (


Why not just use numpy types, which are trivially convertible to numba types via numba.from_dtype(...)?

shwina · 2024-12-06T00:30:24Z

+@pytest.mark.parametrize(
+    "type_obj_from_str", [_iterators.numba_type_from_any, numpy.dtype, cp.dtype]
+)
+@pytest.mark.parametrize("value_type_name", SUPPORTED_VALUE_TYPE_NAMES)


In general, we have found parametrized fixtures to be the better choice when sharing parameters across tests, especially as the codebase evolves:

https://docs.rapids.ai/api/cudf/stable/developer_guide/testing/#parametrization-custom-fixtures-and-pytest-mark-parametrize

Done in commit 6aeeff3

Nice. I didn't realize fixtures can be used in this way.

shwina · 2024-12-06T00:51:03Z

+import numba.cuda
+import numba.types
 import cuda.parallel.experimental as cudax
+from cuda.parallel.experimental import _iterators


If we need something from a non-public submodule in the tests, then it's possible that:

it should go in a public API

we don't really need it

For instance, we are using _iterators.pointer() to construct inputs for one of our tests. This suggests that pointer() should be a public API (OR we are testing something that we don't expect users to ever do).

I think it might be a leftover. _iterators.pointer() is an implementation detail for transform iterator (glue layer to make it support containers). I would suggest to avoid testing reduce with pointer directly and test only reduction of transformed cp.array.

Summary of short offline discussion: Maybe in a follow-on PR:

TransformIterator(identity_op, cupy_array, op_return_value_type)

This way we'd still have a test targeted at RawPointer, but through a public API.

gevtushenko · 2024-12-06T17:06:20Z

+    )
+
+
+def TransformIterator(op, it, op_return_value_type):


question: can we infer return type of the op(it.value_type) somehow? I'd prefer not having value type parameter on transform iterator if possible.

Suggested change

def TransformIterator(op, it, op_return_value_type):

def TransformIterator(op, it):

I think so -- see my comment above.

gevtushenko · 2024-12-06T17:08:09Z

+from . import _iterators
+
+
+def CacheModifiedInputIterator(device_array, value_type, modifier):


question: can we infer value type from device_array? I'd prefer not having value_type parameter on this iterator is possible. Value type should match underlying memory's value type exactly.

Yes, it should be just numba.from_dtype(device_array.dtype).

… functions back to class scope, with comments to explicitly state that these are not actual methods.

@shwina

…suggested by @shwina

shwina

In an offline sync with @rwgk and @gevtushenko, we decided to merge this sooner than later, and follow up to address any remaining review items.

rwgk · 2024-12-06T21:14:15Z

/ok to test

github-actions · 2024-12-06T23:12:58Z

🟩 CI finished in 1h 55m: Pass: 100%/3 | Total: 42m 55s | Avg: 14m 18s | Max: 30m 51s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 12m 04s | Avg: 6m 02s | Max: 9m 56s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  9m 56s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
  🟩 Test               Pass: 100%/1   | Total:  9m 56s | Avg:  9m 56s | Max:  9m 56s

🟩 python: Pass: 100%/1 | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

rwgk added 30 commits October 15, 2024 14:13

New python/cuda_parallel/cuda/parallel/experimental/random_access_ite…

a28ef29

…rators.py (with unit tests).

Use TransformRAI to implement constant, counting, arbitrary RAIs.

d1c4816

Transfer num_item-related changes from cuda_parallel_itertools branch.

ad2cdab

The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

Rename op to reduction_op in cccl_device_reduce_build()

93283a9

Transfer test_device_sum_repeat_1_equals_num_items() from cuda_parall…

7ccc155

…el_itertools branch. The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

Add class _TransformRAIUnaryOp. Make constant_op() less trivial (…

1c7bb56

…to see if numba can still compile it).

Rename class _Op to _ReductionOp for clarity.

8621df1

WIP Use TransformRAIUnaryOp here

22eb5e7

INCOMPLETE SNAPSHOT

2635d56

This links the input_unary_op() successfully into the nvrtc cubin, bu…

3e7700a

…t then fails with: Fatal Python error: Floating point exception

passing _type_to_info_from_numba_type(numba.int32) as value_type …

5876b5d

…resolves the Floating point exception (but the `cccl_device_reduce()` call still does not succeed)

More debug output.

1fae4e2

LOOOK single_tile_kernel CALL /home/coder/cccl/c/parallel/src/reduce.cu:116 LOOOK EXCEPTION CUDA error: invalid argument /home/coder/cccl/c/parallel/src/reduce.cu:703

Substituting fake_in_ptr if in_ptr == nullptr: All tests pass.

46836bf

Rename new test function to test_device_sum_input_unary_op() and pa…

af982b3

…rametrize: `use_numpy_array`: `[True, False]`, `input_generator`: `["constant", "counting", "arbitrary", "nested"]`

Merge branch 'main' into python_random_access_iterators

4a25a65

Remove python/cuda_parallel/cuda/parallel/experimental/random_access_…

60a5bf3

…iterators.py (because numba.cuda cannot JIT classes).

Add "nested_global" test, but disable.

5ba7a0f

cu_repeat(), cu_count(), cu_arbitrary() functions that return a…

fa61574

… `unary_op`, which is then compiled with `numba.cuda.compile()`

Snapshot DOES NOT WORK, from 2024-10-20+1646

d40b913

Files copy-pasted from 2479 comment, then: notabs.py kernel.cpp main.py

1a4650b

Commands to build and run the POC.

d63adff

RawPointer(..., dtype)

9e24a06

assert_expected_output_array(, more_nested_map`

35c5946

Add @register_jitable to cu_repeat(), ..., cu_map(): this fixes…

faa555e

… the `"map_mul2"` test and the added `"map_add10_map_mul2"` test works, too.

Merge branch 'main' into georgii_poc_2479

5336003

Restore original c/parallel/src/reduce.cu (to avoid git merge main …

ae36d8a

…conflicts).

Merge branch 'main' into fancy_iterators

bb70bb6

Merge branch 'georgii_poc_2479' into fancy_iterators

8aea5b4

Transfer a few minor printf("\nLOOOK ...); from python_random_acces…

892eec1

…s_iterators branch.

clang-format auto-fixes

6b5f96d

rwgk added 4 commits December 5, 2024 07:33

Rename ldcs() -> load_cs() and add reference to cub `CacheModifie…

f38b363

…dInputIterator` `LOAD_CS`

Move *_advance and *_dereference functions out of class scopes, t…

708c341

…o make it obvious that they are never used as Python methods, but exclusively as source for `numba.cuda.compile()`

Turn state_c_void_p, size, alignment methods into properties.

aca7856

Improved organization of newly added tests. NO functional changes. Ap…

29f0e52

…plied ruff format to newly added code.

rwgk mentioned this pull request Dec 5, 2024

[Internal Cleanup] Resolve "ATTENTION: NOT op_caller here!" vs "ATTENTION: op_caller here" asymmetry in Transform Iterator implementation. #3064

Closed

Add comments pointing to issue NVIDIA#3064

f3fffa9

rwgk added 2 commits December 5, 2024 15:27

Change ntype to value_type in public APIs. Introduce `numba_type_…

b5065ee

…from_any(value_type)` function.

Change names of function in public API.

b0e600e

shwina requested changes Dec 6, 2024

View reviewed changes

shwina reviewed Dec 6, 2024

View reviewed changes

gevtushenko reviewed Dec 6, 2024

View reviewed changes

rwgk added 2 commits December 6, 2024 10:41

Effectively undo commit 708c341: Move *_advance and *_dereference…

c3c51a5

… functions back to class scope, with comments to explicitly state that these are not actual methods.

Use @pytest.fixture to replace most @pytest.mark.parametrize, as …

6aeeff3

…suggested by @shwina

shwina approved these changes Dec 6, 2024

View reviewed changes

rwgk marked this pull request as ready for review December 6, 2024 22:08

gevtushenko approved these changes Dec 6, 2024

View reviewed changes

gevtushenko merged commit 199f2a5 into NVIDIA:main Dec 6, 2024

rwgk deleted the fancy_iterators branch December 7, 2024 01:17

rwgk mentioned this pull request Dec 7, 2024

[WIP] Enable passing unary_op(distance) as reduce_into(d_in) #2595

Closed

2 tasks

This was referenced Dec 7, 2024

[Improvement] Don't require specifying output type when constructing TransformIterator (cuda.parallel) #3083

Merged

[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118

Merged

This was referenced Dec 19, 2024

[FEA]: Introduce combinational ranges into cuda.parallel #2481

Closed

cuda.parallel: In-memory caching of cuda.parallel build objects #3216

Merged

shwina mentioned this pull request Jan 13, 2025

cuda.parallel: Add optional stream argument to reduce_into() #3348

Merged

2 tasks

rwgk changed the title ~~[WIP] Support fancy iterators in cuda.parallel~~ Support fancy iterators in cuda.parallel Feb 18, 2025

		return self.it.alignment # TODO fix for stateful op


		def TransformIterator(op, it, op_return_ntype):

	def TransformIterator(op, it, op_return_value_type):
	def TransformIterator(op, it):

		from . import _iterators


		def CacheModifiedInputIterator(device_array, value_type, modifier):

Conversation

rwgk commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

rwgk commented Dec 5, 2024

Uh oh!

github-actions bot commented Dec 5, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 7m 02s

🟩 python: Pass: 100%/1 | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

Uh oh!

shwina left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented Dec 6, 2024

Uh oh!

github-actions bot commented Dec 6, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 12m 04s | Avg: 6m 02s | Max: 9m 56s

🟩 python: Pass: 100%/1 | Total: 30m 51s | Avg: 30m 51s | Max: 30m 51s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rwgk commented Nov 13, 2024 •

edited

Loading

shwina Dec 6, 2024 •

edited

Loading

shwina Dec 6, 2024 •

edited

Loading