Skip to content

Sync transformers and accelerate versions#562

Merged
michaelbenayoun merged 52 commits intomainfrom
sync_transformers_and_accelerate
May 16, 2024
Merged

Sync transformers and accelerate versions#562
michaelbenayoun merged 52 commits intomainfrom
sync_transformers_and_accelerate

Conversation

@michaelbenayoun
Copy link
Copy Markdown
Member

@michaelbenayoun michaelbenayoun commented Apr 10, 2024

What does this PR do?

This PR synchronizes optimum-neuron with more recent transformers and accelerate versions:

  • accelerate==0.29.2, which is the latest release when this PR is being done,
  • transformers==4.40.2, which will be the latest releae when this PR is merged.

Related PR in transformers: huggingface/transformers#30259

On top of that:

  • The workflows for Trainium instances have been updated and use K8 now.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@michaelbenayoun michaelbenayoun mentioned this pull request Apr 12, 2024
7 tasks
@michaelbenayoun michaelbenayoun marked this pull request as ready for review May 15, 2024 12:23
@michaelbenayoun michaelbenayoun requested a review from dacorvo May 15, 2024 12:23
Copy link
Copy Markdown
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! Looks good to me, but found a few nits !

Comment thread .github/workflows/test_trainium_common.yml Outdated
Comment thread optimum/neuron/distributed/decoder_models.py Outdated
tr_loss_div = tr_loss / dp_size
tr_loss_div = tr_loss / dp_size

xm.mark_step()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love when these lines pop magically to solve a sync issue ... 😜

Comment thread optimum/neuron/trainers.py Outdated
ignore_keys_for_eval=ignore_keys_for_eval,
**kwargs,
)
# with hub_neuronx_cache("training", entry=self.model_cache_entry):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you don't want to fetch from the cache here ? If so, you should remove the commented line.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, it was a quick test artifact

Comment thread tests/distributed/test_model_parallelization.py Outdated
Comment thread tests/test_trainers.py Outdated
@dacorvo
Copy link
Copy Markdown
Collaborator

dacorvo commented May 15, 2024

You should unpin the safetensor package here because there is now a conflict:

Copy link
Copy Markdown
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks ! If you can fix the seq2seq tracing issue before merging that pull-request that is even better.

@michaelbenayoun
Copy link
Copy Markdown
Member Author

michaelbenayoun commented May 16, 2024

I fixed all but one test:

  • tests/generation/test_tnx_llama.py::test_decoder_generation_multiple_eos_token_ids

@michaelbenayoun michaelbenayoun merged commit d15c130 into main May 16, 2024
@michaelbenayoun michaelbenayoun deleted the sync_transformers_and_accelerate branch May 16, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants