Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Nov 25, 2021

  • Fix to allow setting num_layers=0
  • Improved logging (log on all workers, use logging module)
  • Add an option --log-scales to log scales and gradient scales for parameters and activations.
  • Give names to parameters (and some modules). These names match those in LLM so this can be used as a basis for loading LLM parameters/checkpoints, but they are not the same as the Megatron "keys" for state dict which are difficult to follow.
  • Fix a bug where the dataset index is not found on some workers (see Added torch distributed barrier #2)

# It can take some time for the file to be visible on other nodes.
for _ in range(120):
if indexmap_filename.is_file():
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May add a log message, so the job is not stuck (maybe each 10 s)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary? Worst cast it will crash after 2 minutes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some log and copied to the other datasets

@jlamypoirier jlamypoirier marked this pull request as ready for review November 26, 2021 15:43
@jlamypoirier jlamypoirier requested a review from Am1n3e November 26, 2021 15:43
@jlamypoirier jlamypoirier merged commit 560f1a4 into llm-custom Dec 14, 2021
@jlamypoirier jlamypoirier deleted the measure_scales branch December 14, 2021 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants