Skip to content

Conversation

@sayakpaul
Copy link
Member

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 21, 2023

The documentation is not available anymore as the PR was closed or merged.


# fingerprint used by the cache for the other processes to load the result
# details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
new_fingerprint = Hasher.hash(args)
Copy link
Contributor

@williamberman williamberman Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the args going to actually be good enough to create a hash? Multiple runs of the script might have the same args. Is that ok? I'm not sure I haven't thought through well enough.

Ideally we can get the PID of the parent process if the parent process is accelerate and hash that. If the parent process is not accelerate, we don't have to pass any additional fingerprint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resorting to @lhoestq again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple runs of the script might have the same args. Is that ok? I'm not sure I haven't thought through well enough.

If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?

In any case, coming to your suggestion on

Ideally we can get the PID of the parent process if the parent process is accelerate and hash that. If the parent process is not accelerate, we don't have to pass any additional fingerprint

Are you thinking of something like:

with accelerator.main_process_first():
    from datasets.fingerprint import Hasher

    # fingerprint used by the cache for the other processes to load the result
    # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
    if accelerator.is_main_process:
    	pid = os.getpid()
    new_fingerprint = Hasher.hash(pid)
    train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you if it's ok to reload from the cache when calling map, I'm not as familiar with the script :) if it is ok, a comment in the code would be nice on under what circumstances and why

I'm not familiar on the accelerate forking model and if one of the scripts themselves ends up being the parent process or if there's a separate accelerate script that forks into the children. If you need the parent pid (again depending on who actually gets forked), you would call os.getppid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq WDYT? IMO, it should be okay to

If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?

Copy link
Member

@lhoestq lhoestq Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?

Since the args are the same it will reload from cache in subsequent run instead of reprocessing the data :)
The resulting dataset doesn't depend on whether accelerate is used no ?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me!

@sayakpaul sayakpaul merged commit 4dcab92 into main Jul 21, 2023
@sayakpaul sayakpaul deleted the fixes/controlnet-sdxl-training branch July 21, 2023 15:25
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
* hash computation. thanks to @lhoestq

* disable dtype casting.

* remove comments.
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
* hash computation. thanks to @lhoestq

* disable dtype casting.

* remove comments.
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
* hash computation. thanks to @lhoestq

* disable dtype casting.

* remove comments.
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* hash computation. thanks to @lhoestq

* disable dtype casting.

* remove comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants