Skip to content

feat(ci): added a network debug report#44636

Merged
tarekziade merged 12 commits intomainfrom
tarekziade/network-metrics
Mar 18, 2026
Merged

feat(ci): added a network debug report#44636
tarekziade merged 12 commits intomainfrom
tarekziade/network-metrics

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

What does this PR do?

  • Adds an httpx tracer to gather metrics about network calls
  • Collect and store metrics and generates an artifact in CI
  • Can be used locally with DEBUG_NETWORK
  • Activated in CircleCI

example of local run:

✗ DEBUG_NETWORK=1 HUGGINGFACE_CO_STAGING=1 pytest -svx tests/utils/test_tokenization_utils.py -k test_push_to_hub_chat_templates

====================================================================================== Network debug ======================================================================================
Network debug report
Requests captured: 3
Failed requests: 1
Cumulative request time: 21488.3 ms
Phase totals: receive_response_headers=20555.5 ms, start_tls=641.7 ms, connect_tcp=284.9 ms, send_request_headers=1.2 ms, response_closed=0.5 ms, send_request_body=0.4 ms, receive_response_body=0.2 ms

Slowest requests:
 1. POST https://hub-ci.huggingface.co/api/repos/create 10002.4 ms ReadTimeout: The read operation timed out (receive_response_headers=10001.2 ms)
 2. DELETE https://hub-ci.huggingface.co/api/repos/delete 7902.7 ms status=200 (connect_tcp=147.6 ms, start_tls=283.1 ms, receive_response_headers=7469.5 ms, receive_response_body=0.1 ms)
 3. POST https://hub-ci.huggingface.co/api/repos/create 3583.2 ms status=200 (connect_tcp=137.3 ms, start_tls=358.6 ms, receive_response_headers=3084.8 ms, receive_response_body=0.1 ms)

Slowest routes:
 1. POST hub-ci.huggingface.co/api/repos/create count=2 total=13585.6 ms avg=6792.8 ms failures=1
 2. DELETE hub-ci.huggingface.co/api/repos/delete count=1 total=7902.7 ms avg=7902.7 ms failures=0

@tarekziade tarekziade requested a review from ydshieh March 12, 2026 15:25
@tarekziade tarekziade self-assigned this Mar 12, 2026
@tarekziade
Copy link
Copy Markdown
Collaborator Author

Another report example

Network debug report
Requests captured: 12
Failed requests: 0
Cumulative request time: 2399.0 ms
Phase totals: receive_response_headers=2187.1 ms, connect_tcp=126.4 ms, start_tls=63.3 ms, send_request_headers=3.5 ms, receive_response_body=2.7 ms, send_request_body=0.5 ms, response_closed=0.3 ms

Slowest requests:
 1. HEAD https://huggingface.co/hf-internal-testing/tiny-random-bert/resolve/main/tokenizer_config.json 355.6 ms status=307 (connect_tcp=126.4 ms, start_tls=63.3 ms, receive_response_headers=163.0 ms, receive_response_body=0.1 ms)
 2. GET https://huggingface.co/api/models/openai-community/gpt2/tree/main?... 344.3 ms status=200 (receive_response_headers=342.0 ms, receive_response_body=0.7 ms)
 3. GET https://huggingface.co/api/models/hf-internal-testing/tiny-random-bert/tree/main?... 220.7 ms status=200 (receive_response_headers=217.8 ms, receive_response_body=0.9 ms)
 4. GET https://huggingface.co/api/models/openai-community/gpt2/tree/main?... 189.3 ms status=200 (receive_response_headers=187.4 ms, receive_response_body=0.3 ms)
 5. HEAD https://huggingface.co/openai-community/gpt2/resolve/main/tokenizer_config.json 187.9 ms status=307 (receive_response_headers=186.8 ms, receive_response_body=0.0 ms)
 6. HEAD https://huggingface.co/api/resolve-cache/models/openai-community/gpt2/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer_config.json 183.1 ms status=200 (receive_response_headers=181.2 ms, receive_response_body=0.1 ms)
 7. GET https://huggingface.co/api/models/openai-community/gpt2/tree/main/additional_chat_templates?... 181.7 ms status=404 (receive_response_headers=179.9 ms, receive_response_body=0.1 ms)
 8. GET https://huggingface.co/api/models/hf-internal-testing/tiny-random-bert/tree/main/additional_chat_templates?... 172.3 ms status=404 (receive_response_headers=170.8 ms, receive_response_body=0.1 ms)
 9. HEAD https://huggingface.co/api/resolve-cache/models/hf-internal-testing/tiny-random-bert/f171d7baecaf37b5da5a3616d8833b9969753535/tokenizer_config.json 170.0 ms status=200 (receive_response_headers=168.6 ms, receive_response_body=0.0 ms)
10. HEAD https://huggingface.co/openai-community/gpt2/resolve/main/tokenizer_config.json 167.8 ms status=307 (receive_response_headers=165.7 ms, receive_response_body=0.1 ms)
11. GET https://huggingface.co/api/models/openai-community/gpt2/tree/main/additional_chat_templates?... 157.7 ms status=404 (receive_response_headers=156.5 ms, receive_response_body=0.1 ms)
12. HEAD https://huggingface.co/api/resolve-cache/models/openai-community/gpt2/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer_config.json 68.6 ms status=200 (receive_response_headers=67.4 ms, receive_response_body=0.0 ms)

Slowest routes:
 1. GET huggingface.co/api/models/openai-community/gpt2/tree/main count=2 total=533.5 ms avg=266.8 ms failures=0
 2. HEAD huggingface.co/openai-community/gpt2/resolve/main/tokenizer_config.json count=2 total=355.7 ms avg=177.8 ms failures=0
 3. HEAD huggingface.co/hf-internal-testing/tiny-random-bert/resolve/main/tokenizer_config.json count=1 total=355.6 ms avg=355.6 ms failures=0
 4. GET huggingface.co/api/models/openai-community/gpt2/tree/main/additional_chat_templates count=2 total=339.5 ms avg=169.7 ms failures=0
 5. HEAD huggingface.co/api/resolve-cache/models/openai-community/gpt2/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer_config.json count=2 total=251.7 ms avg=125.9 ms failures=0
 6. GET huggingface.co/api/models/hf-internal-testing/tiny-random-bert/tree/main count=1 total=220.7 ms avg=220.7 ms failures=0
 7. GET huggingface.co/api/models/hf-internal-testing/tiny-random-bert/tree/main/additional_chat_templates count=1 total=172.3 ms avg=172.3 ms failures=0
 8. HEAD huggingface.co/api/resolve-cache/models/hf-internal-testing/tiny-random-bert/f171d7baecaf37b5da5a3616d8833b9969753535/tokenizer_config.json count=1 total=170.0 ms avg=170.0 ms failures=0

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade tarekziade force-pushed the tarekziade/network-metrics branch 3 times, most recently from ef75dfc to 67c6933 Compare March 16, 2026 15:27
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Have a nit question about env. name, and if we really need to have that many functions being public, but those questions are not blocking.

One question I asked myself and then ask Claude is:

Could we minimize the changes in conftest file: the answer is yes but require some design changes. I won't request to change the design in this PR, but just share what I am told here if you are also interested.

Another possibility (AI generated content)

Honestly, they are justified **given the current design** — conftest needs to orchestrate the xdist coordination itself across multiple hooks, so it necessarily needs multiple entry points into `network_logging`.

But the deeper question is whether this design is the right one. An alternative would be a **plugin class** inside `network_logging.py`:

```python
class NetworkDebugPlugin:
    def pytest_configure(self, config): ...
    def pytest_configure_node(self, node): ...
    def pytest_sessionfinish(self, session): ...
    def pytest_terminal_summary(self, terminalreporter): ...

Then conftest would shrink to just:

# conftest.py
from transformers.utils.network_logging import register_network_debug_plugin

def pytest_configure(config):
    ...
    register_network_debug_plugin(config)  # ← single call, plugin handles everything

And register_network_debug_plugin would do:

def register_network_debug_plugin(config):
    if _parse_debug_network_env()[0]:
        config.pluginmanager.register(NetworkDebugPlugin())

The advantages would be:

  • Public API shrinks to one function
  • All xdist coordination logic lives in one place inside network_logging.py instead of being spread across conftest hooks
  • conftest stays clean and doesn't need to know about xdist details at all

So the 5 functions are a consequence of the design choice to let conftest drive the orchestration, rather than encapsulating it inside network_logging.py itself. Either approach works, but the plugin class would be cleaner.

Comment thread .circleci/create_circleci_config.py Outdated
Comment thread src/transformers/utils/network_logging.py
Comment thread src/transformers/utils/network_logging.py Outdated
Comment thread tests/utils/test_network_logging.py Outdated

from transformers.utils.network_logging import (
clear_network_debug_report,
disable_network_debug_report,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So not used in conftest but for testing.

Kind justify it being public 👍

Comment thread conftest.py
@tarekziade
Copy link
Copy Markdown
Collaborator Author

Very nice!
[snip]

The advantages would be:

* Public API shrinks to **one function**

* All xdist coordination logic lives in one place inside `network_logging.py` instead of being spread across conftest hooks

* conftest stays clean and doesn't need to know about xdist details at all

So the 5 functions are a consequence of the design choice to let conftest drive the orchestration, rather than encapsulating it inside network_logging.py itself. Either approach works, but the plugin class would be cleaner.

Thanks for the review. I agree that, since this lives under src/transformers, we should keep the public API surface as small as possible, as any new entry points are likely to become part of our long-term compatibility burden.

I initially assumed this would remain internal to our CI usage, so I didn’t optimize for that, but your point makes sense.

The plugin-based approach sounds like a cleaner direction for that reason, so I’ll rework the patch to encapsulate more of the orchestration in network_logging.py and reduce what needs to be exposed from conftest.

@tarekziade tarekziade force-pushed the tarekziade/network-metrics branch from 5aaba88 to 0d13a9f Compare March 17, 2026 12:57
@tarekziade tarekziade requested a review from ydshieh March 17, 2026 13:04
@tarekziade tarekziade enabled auto-merge March 18, 2026 15:44
@tarekziade tarekziade added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026
@tarekziade tarekziade added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Mar 18, 2026
@tarekziade tarekziade added this pull request to the merge queue Mar 18, 2026
Merged via the queue into main with commit 2513237 Mar 18, 2026
29 checks passed
@tarekziade tarekziade deleted the tarekziade/network-metrics branch March 18, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants