Skip to content

Even more test data cached#40636

Merged
ydshieh merged 5 commits intomainfrom
cache_all
Sep 3, 2025
Merged

Even more test data cached#40636
ydshieh merged 5 commits intomainfrom
cache_all

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Sep 3, 2025

What does this PR do?

Even more test data cached

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +48 to +50
url_to_local_path(
"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the name url_to_local_path is misleading, sorry.

It actually check if the file exists locally ( the root of transformers), return local path (i.e. its filename) if it is there, otherwise return the original url.

I will change the name to make it more clear.

But how we need to handle both cases in a better way, but not if (url ) ... else ... (local path) everywhere.

@zucchini-nlp Do we have some excising methods that could handle this well?

(chat template could handle both cases, but here we need some more primitive functions)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i remember looking at how url_to_local_path works. Makes sense to me

@zucchini-nlp Do we have some excising methods that could handle this well?

Not sure I got this. Do you mean a method to do url_to_local_path in chat templates?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, What I mean is:

this block

                requests.get(
                    url_to_local_path(
                        "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    )

but url_to_local_path may return the original url or the local path (if it finds).

Therefore we want to have a simple function to handle both case like here: as requests.get doesn't work with local path.

So I am wondering if you remember if we have already utility functions for doing things like this (and likely used by chat template ..?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, no, I don't remember any existing utility we have. Maybe we could put all images/videos/audios in hub and use hf-hub download. AFAIK it should cache everything automatically?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I find

load_image

src/transformers/image_utils.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will use it and merge the PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same utility I use in chat templates, and it has no caching mechanism. It will keep sending requests to download every time

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean simply use it will fix the issue. But it combined with url_to_local_path will (if we have a previous step of downloading the files before the test run step).

i.e.

load_image(url_to_local_path( ... url ...))

will work well (assuming already downloaded)

while

        cls.image1 = Image.open(
            BytesIO(
                requests.get(
                    url_to_local_path(
                        "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    )
             )
        )

will fail as it can't handle the local path.

(We are talking 2 things different here. My question is more about how to handle (i.e. load things into memory) both cases in a simple way, assuming the files are already downloaded)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh sorry, I misunderstood the comment probably. Right, load_image handles all input types

@ydshieh ydshieh requested a review from zucchini-nlp September 3, 2025 07:42
Comment on lines +48 to +50
url_to_local_path(
"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i remember looking at how url_to_local_path works. Makes sense to me

@zucchini-nlp Do we have some excising methods that could handle this well?

Not sure I got this. Do you mean a method to do url_to_local_path in chat templates?

@ydshieh ydshieh force-pushed the cache_all branch 2 times, most recently from 4885859 to e8fdb70 Compare September 3, 2025 17:18
@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Sep 3, 2025

TODO: explore of using hf_hub_download instead of url_to_local_path. If not fit, use a better name than url_to_local_path which is confusing.

@ydshieh ydshieh enabled auto-merge (squash) September 3, 2025 17:46
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 3, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, bridgetower, cohere2_vision, deepseek_vl, deepseek_vl_hybrid, eomt, fuyu, gemma3, gemma3n, glm4v, idefics2, idefics3, internvl, janus, kosmos2

@ydshieh ydshieh merged commit 34595cf into main Sep 3, 2025
25 checks passed
@ydshieh ydshieh deleted the cache_all branch September 3, 2025 21:20
@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Sep 4, 2025

hf_hub_download indeed could work as it returns

    Returns:
        `str`: Local path of file or if networking is off, last version of file cached on disk.

However:

  1. can't work with a simple url string
  2. can't work with data hosted outside HF hub

For 2., we could always upload to a HF hub repo. For 1., the changes need to be done is a bit annoying.

For now, I will simply rename url_to_local_path to be less confusing, and maybe allow to specify the root output directory.

This is intended to be used only on a CI environment (CI runners) and not for local development. For now, let's not spend to much time on engineering it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants