-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
A recent change #51901 leading to a regression in the Benchstone.BenchI.Array2 benchmark on Ubuntu (but not Windows): #52316.
The core of the benchmark is the Bench function inner loop:
for (; loop != 0; loop--) {
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
for (int k = 0; k < 10; k++) {
d[i][j][k] = s[i][j][k];
}
}
}
}
The code of this loop is almost equivalent, modulo register allocation, before and after #51901. The difference is loop alignment: before #51901, the loop fits in 2 32-byte chunks; after, it is in 3 32-byte chunks. On Ubuntu, this leads to about a 50% performance regression. Simply setting COMPlus_JitAlignLoopAdaptive=0 changes the alignment such that the inner loop fits in 2 32-byte chunks, recovering the performance.
This is a high weight basic block; perhaps the alignment heuristics should "try harder" and be willing to insert more alignment padding in case it might be profitable?