Skip to content

Ideation on making Pthread more scalable #4645

@shivammonaka

Description

@shivammonaka

Hello,

I'm currently working on optimizing the scalability of the openBLAS Pthread flow. Presently, I've observed that even when a BLAS call requires only 8 threads for execution on a 64-core machine, it still locks all available resources using level3_lock in level3_thread.c. These resources are only released after the execution completes, resulting in poor CPU utilization (approximately 12.5%).

My goal is to maximize CPU resource utilization, ideally reaching close to 100%. To achieve this, I have a theoretical concept in mind and would greatly appreciate community suggestions and insights.

The Idea:
Instead of utilizing a mutex lock at level3_thread.c, I propose employing a locking mechanism with conditional wait. This would allow more BLAS calls to proceed until all CPUs are fully utilized. Upon completion of a BLAS operation, the corresponding CPU can be released, signaling the waiting threads to check for resource availability again. Resource allocation and deallocation can be managed through a thread-safe mechanism.

I'm seeking feedback on the feasibility and effectiveness of this approach. Are there any potential oversights or inaccuracies in my understanding? I'm open to any insights or suggestions for further improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions