-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Hello,
I'm currently working on optimizing the scalability of the openBLAS Pthread flow. Presently, I've observed that even when a BLAS call requires only 8 threads for execution on a 64-core machine, it still locks all available resources using level3_lock in level3_thread.c. These resources are only released after the execution completes, resulting in poor CPU utilization (approximately 12.5%).
My goal is to maximize CPU resource utilization, ideally reaching close to 100%. To achieve this, I have a theoretical concept in mind and would greatly appreciate community suggestions and insights.
The Idea:
Instead of utilizing a mutex lock at level3_thread.c, I propose employing a locking mechanism with conditional wait. This would allow more BLAS calls to proceed until all CPUs are fully utilized. Upon completion of a BLAS operation, the corresponding CPU can be released, signaling the waiting threads to check for resource availability again. Resource allocation and deallocation can be managed through a thread-safe mechanism.
I'm seeking feedback on the feasibility and effectiveness of this approach. Are there any potential oversights or inaccuracies in my understanding? I'm open to any insights or suggestions for further improvement.