-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Do you have any thoughts on achieving vectorization via tuple_map, tuple_reduce, or any other operations in this library?
Over the years, I have experimented with implementing my own for_each_n using OpenMP/TBB and various versions of Intel's vectorization directives -- as would be allowed by a call to for_each_n with parallel_vector_execution_policy. This was unsuccessful for the case of non-trivial loops. I definitely could investigate that approach further, and it probably was just a result of limitations of older versions of icpc and my own lack of knowledge about vectorization. I could also just wait to try out implementations of parallel_vector_execution_policy...
However, your library got me thinking that perhaps vectorization could be achieved in a more explicit fashion: instead of vectorizing entire loops, perhaps instead vectorization could be achieved inside of each iteration of these loops, by vectorizing each individual tuple operation (implementable by calls to tuple_map, tuple_reduce, etc.). Typical tuple sizes are currently O(10).
The general pattern (in pseudocode) I deal with is:
for(auto cell : grid) // <--- parallelize here, no luck at requesting vectorization here (pragma ivdep, simd, etc.)
{
// read tuple data in from memory -- tuple_size is O(10)
// tuple_map <--- vectorize here
// tuple_reduce <-- vectorize here
// tuple_map <-- vectorize here
// etc.
// store tuple data to memory
}