Skip to content

[FEA]: Introduce cache-modified input iterator into cuda.parallel #2536

@gevtushenko

Description

@gevtushenko

Is this a duplicate?

Area

Not sure

Is your feature request related to a problem? Please describe.

Usage of cuda.parallel in applications like llm.c (example) is currently blocked by lack of cache-modified iterators support.

Describe the solution you'd like

We need an functional alternative of cache-modified iterator in cuda.parallel.itertools. Design might follow the API that @fbusato came up with in #2487. For instance:

d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
d_streaming_input = cudax.itertools.accessor(d_input, "eviction_policy::no_allocation")
cudax.reduce(d_streaming_input)

should lead to streaming loads of d_input (ld.global.cs instruction in PTX)

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions