Conversation
src/CodeGen_OpenCL_Dev.cpp
Outdated
|
|
||
| if (is_gpu_var(loop->name)) { | ||
| // initialize the repsentive sum value (might already be loaded in case of sum[0]) | ||
| stream << get_indent() << "_sum[" << print_name(loop->name) << "] = _sum[" << print_name(loop->name) << "] + "<< print_name(loop->name) << ";\n"; |
There was a problem hiding this comment.
I think actually you probably want to store the intermediate result in a local array.
Something like __local int local_sum[1024]
The 1024 needs to be a constant number, so you can start with just 1024 or whatever, but probably the value is saved in:
loop->extent
Furthermore, you hard code _sum here.
I think you can get the string of sum like (such that we can name it any other name as well):
Store* store = loop->body.as<Store>()
string store_name = print_name(store->name);And lastly, instead of "] + "<< print_name(loop->name) << ";\n"; we can probably write:
stream << ... << "] + ";
Store* add = loop->body.as<Store>()->value.as<Add>()
add->b.accept() // This will print the contents of b
stream << << ";\n";| stream << get_indent() << "_sum[" << print_name(loop->name) << "] = _sum[" << print_name(loop->name) << "] + "<< print_name(loop->name) << ";\n"; | ||
|
|
||
| // Wait for all threads to do this | ||
| stream << get_indent() << "barrier(CLK_LOCAL_MEM_FENCE);\n"; |
There was a problem hiding this comment.
To be exact I think you need a CLK_GLOBAL_MEM_FENCE as well since you are storing the results in global memory (sum)
There was a problem hiding this comment.
However if you want to use local memory, this is fine as is
src/CodeGen_OpenCL_Dev.cpp
Outdated
| // compute the sum based on parallel reduction, wait on each thread after each loop step | ||
| stream << get_indent() << "for (unsigned int i = group_size_2 / 2; i > 0; i >>= 1) {;\n"; | ||
| stream << get_indent() << " if (" << print_name(loop->name) << " < i) {\n"; | ||
| stream << get_indent() << " _sum[" << print_name(loop->name) << "] += _sum[" << print_name(loop->name) << " + i];\n"; |
There was a problem hiding this comment.
Probably use local_sum here.
| stream << get_indent() << " _sum[" << print_name(loop->name) << "] += _sum[" << print_name(loop->name) << " + i];\n"; | ||
| stream << get_indent() << " }\n"; | ||
| stream << get_indent() << " barrier(CLK_LOCAL_MEM_FENCE);\n"; | ||
| stream << get_indent() << "}\n"; |
There was a problem hiding this comment.
Then afterwards we can say
if(print_name(loop->name)==0){
sum[0] == local_sum[0];
}
…not constantly pushing to global memory
…an actually read what is added in a parallel reduction.
No description provided.