Is this a duplicate?
Area
Not sure
Is your feature request related to a problem? Please describe.
Usage of cuda.parallel in applications like llm.c (example) is currently blocked by lack of cache-modified iterators support.
Describe the solution you'd like
We need an functional alternative of cache-modified iterator in cuda.parallel.itertools. Design might follow the API that @fbusato came up with in #2487. For instance:
d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
d_streaming_input = cudax.itertools.accessor(d_input, "eviction_policy::no_allocation")
cudax.reduce(d_streaming_input)
should lead to streaming loads of d_input (ld.global.cs instruction in PTX)
Describe alternatives you've considered
No response
Additional context
No response
Is this a duplicate?
Area
Not sure
Is your feature request related to a problem? Please describe.
Usage of cuda.parallel in applications like llm.c (example) is currently blocked by lack of cache-modified iterators support.
Describe the solution you'd like
We need an functional alternative of cache-modified iterator in cuda.parallel.itertools. Design might follow the API that @fbusato came up with in #2487. For instance:
should lead to streaming loads of
d_input(ld.global.csinstruction in PTX)Describe alternatives you've considered
No response
Additional context
No response