Conversation
README.md
Outdated
There was a problem hiding this comment.
Not only causal, also chunked and local
There was a problem hiding this comment.
Its an upper bound - doesn't mean its actually achievable, right ?
There was a problem hiding this comment.
Yes, should we add something like this: "While achieving 100% is not practical due to many factors, the MFU score effectively shows how much room is left for optimization."
There was a problem hiding this comment.
Done!
Note we've gotten 70% MFU before on v5p, I've heard 80%+ MFU (even bf16, probably also v5p), its theoretically possible to get pretty close
There was a problem hiding this comment.
Yes, should we add something like this: "While achieving 100% is not practical due to many factors, the MFU score effectively shows how much room is left for optimization."
000af45 to
853d8b0
Compare
There was a problem hiding this comment.
Do we want to say anything more about local or chunked attention ?
853d8b0 to
51317f8
Compare
README.md
Outdated
There was a problem hiding this comment.
dividing the attention the flops -> dividing the attention flops (no the)
51317f8 to
12141a3
Compare
4e2d5c0 to
924f4e0
Compare
924f4e0 to
df562c9
Compare
Description
Add a ReadMe section discussing Model flops utilization MFU (definition and how we report it).
We may want to add sections to this in the future (e.g. hardware utilizations or memory usage)
This is meant to help clarify the recent change about our attention flop calculation change (accounting for causality) in #1988
Note for reviewers: Click on display rich diff to see resultant markdown: https://screenshot.googleplex.com/9WxhjW8EV6PWJ9B
Tests
N/A readme
Checklist
Before submitting this PR, please make sure (put X in square brackets):