Skip to content

Conversation

@tkonolige
Copy link
Contributor

Add functions to estimate peak flops and bandwidth for CUDA. Add a new registration mechanism to the roofline analysis to support adding any target. This mechanism uses generic functions with overrides. New targets only need to add estimate_peak_bandwidth and estimate_peak_flops functions.

Also fix cuda codegen and tensorcore_infer_fragment.cc to support filling matrix_a and matrix_b fragments.

@AndrewZhaoLuo

@AndrewZhaoLuo AndrewZhaoLuo self-requested a review July 27, 2022 20:20
@AndrewZhaoLuo
Copy link
Contributor

Will take a look tomorrow

Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to grok the tensorcore stuff a bit but seems good so far. On my 3070

I get 420 Gb/s bandwidth vs the 448 advertised. For the TFLops I actually get more than the 40.6 TFLops advertised (I get 41.2 TFlops which seems close enough)

Tristan Konolige added 6 commits July 29, 2022 08:56
Add functions to estimate peak flops and bandwidth for CUDA. Add a new
registration mechanism to the roofline analysis to support adding any
target. This mechanism uses generic functions with overrides. New
targets only need to add `estimate_peak_bandwidth` and
`estimate_peak_flops` functions.

Also fix cuda codegen and tensorcore_infer_fragment.cc to support
filling matrix_a and matrix_b fragments.
@AndrewZhaoLuo AndrewZhaoLuo merged commit 961a7c7 into apache:main Jul 30, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [ROOFLINE] Add CUDA support to roofline analysis

Add functions to estimate peak flops and bandwidth for CUDA. Add a new
registration mechanism to the roofline analysis to support adding any
target. This mechanism uses generic functions with overrides. New
targets only need to add `estimate_peak_bandwidth` and
`estimate_peak_flops` functions.

Also fix cuda codegen and tensorcore_infer_fragment.cc to support
filling matrix_a and matrix_b fragments.

* formatiing

* move statement back inside loops

* print out report for debugging

* default to avx2

* review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants