-
Notifications
You must be signed in to change notification settings - Fork 55
Add FAQ.md to provide sample estimation guidelines #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b31ff34 to
1f568eb
Compare
| For each knowledge leaf node: the formula to estimate the number of produced synthetic samples in the training dataset is: | ||
|
|
||
| ```text | ||
| (total cumulative size of knowledge documents / max document chunk size) * number of qna pairs in the knowledge file leaf node * 30 synthetic samples per qna pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this factor in the amount of samples that will be filtered out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobotSail this is an excellent point: it does not since that part is non detemenistic to a certain extent: but folks have been looking to be able to get some guidance on some ball park numbers/ answers to general questions around how the taxonomy is processed through SDG
I should call that out as a disclaimer and will do that
b03fc41 to
9cbd2ac
Compare
The new FAQ.md file includes detailed explanations and examples on how to estimate the number of synthetic samples produced at various stages of the SDG training process. This addition aims to enhance user understanding of the sample generation methodology. Signed-off-by: Tyler Lisowski <lisowski@us.ibm.com>
|
These are a good initial set of FAQs that we have seen pop up with what I feel (based on my studies of the codebase) are the appropriate answers. More than happy to adjust anything that is inaccurate though based on expert opinion! |
bbrowning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for expanding our docs! I have a couple of comments, but also a more general question. Do you think it would be better to document most of these things in the instructlab/instructlab repository instead of directly in SDG? The reason I ask is that the questions touch on taxonomy, SDG, training, and the intersection of all these. And the target users here would likely be using the ilab CLI to run these workflows as opposed to SDG directly?
|
@bbrowning sorry for delay! Thank you so much for review! I agree with you that maybe there is a better home for this page. I will go ahead and get the comments addressed, we can make sure we are all comfortable with the content: and then we can think on where we want it's home to be! |
Signed-off-by: Ben Browning <bbrownin@redhat.com>
bbrowning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added in the requested changes (cleaning up links, fixing typo) as an additional commit and will go ahead and approve to get this merged.
|
Thank you so much @bbrowning for the assistance! Sorry for the delay hope everything is wonderful with you! |
|
All is well, and you're welcome! Sorry it took us so long to get this in, but thank you for the contribution! |
The new FAQ.md file includes detailed explanations and examples on how to estimate the number of synthetic samples produced at various stages of the SDG training process. This addition aims to enhance user understanding of the sample generation methodology.
I believe it's an MVP to resolving: #307