Skip to content

(Draft) Add DLA function to utils#466

Closed
VasilGeorgiev39 wants to merge 3 commits intoTransformerLensOrg:devfrom
VasilGeorgiev39:add-dla-to-utils
Closed

(Draft) Add DLA function to utils#466
VasilGeorgiev39 wants to merge 3 commits intoTransformerLensOrg:devfrom
VasilGeorgiev39:add-dla-to-utils

Conversation

@VasilGeorgiev39
Copy link
Copy Markdown
Contributor

Description

DLA is usually the first step we do in a new exploration. I think it would be nice to have a common function that does it in a single step.

Let me know if you think this does not generalize well enough or if you have other concerns.

Not sure if Utils is the right place for it tho, maybe we can create a new module that will hold the mech interp toolkit?

If it looks good I'll write tests and stuff.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@alan-cooney
Copy link
Copy Markdown
Collaborator

Thanks for starting on this - it seems useful and I agree that it should be it's own file (probably just for DLA as it'll become quite large once it's fully documented etc).

In general I agree as well that it's probably worth expanding this a bit to work more generally. Specifically you can break DLA down recursively e.g. by attention layer -> attention head -> source layer -> source component... It would be nice to hae this as well.

Hope that makes sense and if you are unsure about how to abstract more I'm happy to have a chat about it!

@VasilGeorgiev39
Copy link
Copy Markdown
Contributor Author

Hi @alan-cooney, thanks for the comment. I have a couple questions:

I can get the attention head contributions (or even the mlp neurons) with get_full_resid_decomposition(), however I can get the correct and incorrect directions only for the residual stream with tokens_to_residual_directions(). How can I get the directions for the individual heads (or even neurons) ?

Also, what do you mean by break down by 'source layer' and 'source component' ?

@bryce13950
Copy link
Copy Markdown
Collaborator

@VasilGeorgiev39 Are you still available to wrap this up?

@VasilGeorgiev39
Copy link
Copy Markdown
Contributor Author

@bryce13950 Yes, I will be available after the 9th of May. What do you think would be the best approach for this?

@bryce13950
Copy link
Copy Markdown
Collaborator

I am not quite sure. Alan has been pulled away for his full time job in the last few months. I have reached out to him separately to see if he can clarify the comments on this, but I haven't heard back via slack. I don't really get what he means by source layer and source component either. Maybe we can start by turning it into its own module, and then seeing where it can be generalized. I do like your idea of setting it up as a tool, and I am likely going to be doing just that in another context. Do you want to move this into it's own module in a directly named tools?

@bryce13950 bryce13950 changed the base branch from main to dev May 23, 2024 00:36
@jlarson4
Copy link
Copy Markdown
Collaborator

Unfortunately, this PR is much too far behind the current state of the repository. I am going to open an issue for this feature as something that can potentially be added in the future as a tool under the tools directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants