Skip to content

Adding cross thread reduction on GPU#1

Open
sakehl wants to merge 10 commits intosakehl:mainfrom
PimVanLeeuwen:main
Open

Adding cross thread reduction on GPU#1
sakehl wants to merge 10 commits intosakehl:mainfrom
PimVanLeeuwen:main

Conversation

@sakehl
Copy link
Owner

@sakehl sakehl commented Jan 16, 2024

No description provided.


if (is_gpu_var(loop->name)) {
// initialize the repsentive sum value (might already be loaded in case of sum[0])
stream << get_indent() << "_sum[" << print_name(loop->name) << "] = _sum[" << print_name(loop->name) << "] + "<< print_name(loop->name) << ";\n";
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually you probably want to store the intermediate result in a local array.
Something like __local int local_sum[1024]
The 1024 needs to be a constant number, so you can start with just 1024 or whatever, but probably the value is saved in:
loop->extent

Furthermore, you hard code _sum here.

I think you can get the string of sum like (such that we can name it any other name as well):

Store* store = loop->body.as<Store>()
string store_name = print_name(store->name);

And lastly, instead of "] + "<< print_name(loop->name) << ";\n"; we can probably write:

stream << ... << "] + ";
Store* add = loop->body.as<Store>()->value.as<Add>()
add->b.accept() // This will print the contents of b
stream << << ";\n";

stream << get_indent() << "_sum[" << print_name(loop->name) << "] = _sum[" << print_name(loop->name) << "] + "<< print_name(loop->name) << ";\n";

// Wait for all threads to do this
stream << get_indent() << "barrier(CLK_LOCAL_MEM_FENCE);\n";
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be exact I think you need a CLK_GLOBAL_MEM_FENCE as well since you are storing the results in global memory (sum)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However if you want to use local memory, this is fine as is

// compute the sum based on parallel reduction, wait on each thread after each loop step
stream << get_indent() << "for (unsigned int i = group_size_2 / 2; i > 0; i >>= 1) {;\n";
stream << get_indent() << " if (" << print_name(loop->name) << " < i) {\n";
stream << get_indent() << " _sum[" << print_name(loop->name) << "] += _sum[" << print_name(loop->name) << " + i];\n";
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably use local_sum here.

stream << get_indent() << " _sum[" << print_name(loop->name) << "] += _sum[" << print_name(loop->name) << " + i];\n";
stream << get_indent() << " }\n";
stream << get_indent() << " barrier(CLK_LOCAL_MEM_FENCE);\n";
stream << get_indent() << "}\n";
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then afterwards we can say

if(print_name(loop->name)==0){
  sum[0] == local_sum[0];
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant