fix: Upgrade to youtube-transcript-api v1.0+ by Cristhianzl · Pull Request #10456 · langflow-ai/langflow

Cristhianzl · 2025-10-30T20:35:35Z

This pull request updates the YouTube transcript analysis component to support the new youtube-transcript-api v1.0+ and refactors both the implementation and its tests for compatibility and robustness. The most important changes are grouped below:

Component and Dependency Updates

Updated the youtube-transcript-api dependency in pyproject.toml to require version >=1.0.0,<2.0.0, ensuring compatibility with the new API.
Refactored the YouTubeTranscriptsComponent in Youtube Analysis.json to use the new API, including extracting video IDs from URLs, handling transcript fetching and translation, and grouping transcript segments into time-based chunks. The implementation now robustly handles errors and supports both object and dict transcript formats.
Updated the starter project metadata to reflect new dependency versions and removed unused dependencies (langchain_community). [1] [2]

Testing Improvements

Added a FetchedTranscriptSnippetMock class to simulate transcript snippets using the new API format in unit tests.
Updated test fixtures in test_youtube_transcript_component.py to use the new mock class and added a mock for transcript list objects, improving test coverage for the updated component logic.

REC-20251030173612.mp4

coderabbitai · 2025-10-30T20:35:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updated youtube-transcript-api dependency to version >=1.0.0,<2.0.0 and refactored the YouTubeTranscriptsComponent to use the new API with video ID extraction from URLs, transcript chunking logic, improved error handling, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Dependency Updates `pyproject.toml`	Updated youtube-transcript-api from exact version 0.6.3 to range >=1.0.0,<2.0.0, widening minimum version compatibility.
YouTube Component Implementation `src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json`, `src/lfx/src/lfx/components/youtube/youtube_transcripts.py`	Replaced legacy loader with YouTubeTranscriptApi-based implementation. Added `_extract_video_id()` for URL parsing, `_load_transcripts()` for API retrieval with translation support, and `_chunk_transcript()` for time-based chunking. Updated all output methods (get_dataframe_output, get_message_output, get_data_output) to use new transcript format and include explicit error handling for TranscriptsDisabled, NoTranscriptFound, and retrieval failures.
Test Suite `src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py`	Added FetchedTranscriptSnippetMock class and comprehensive test cases covering video ID extraction, output generation, error scenarios (TranscriptsDisabled, NoTranscriptFound, general exceptions), empty transcripts, transcript chunking with configurable chunk_size_seconds, and translation support.

Sequence Diagram

sequenceDiagram
    participant User
    participant YouTubeComponent
    participant YouTubeTranscriptApi
    
    User->>YouTubeComponent: Provide YouTube URL
    YouTubeComponent->>YouTubeComponent: _extract_video_id(url)
    alt Valid Video ID
        YouTubeComponent->>YouTubeTranscriptApi: list(video_id)
        YouTubeTranscriptApi-->>YouTubeComponent: Available transcripts
        alt Transcript Available
            YouTubeComponent->>YouTubeTranscriptApi: fetch(video_id, language)
            YouTubeTranscriptApi-->>YouTubeComponent: Transcript segments
            alt Translation Requested
                YouTubeComponent->>YouTubeTranscriptApi: fetch with translate
                YouTubeTranscriptApi-->>YouTubeComponent: Translated segments
            end
            YouTubeComponent->>YouTubeComponent: _chunk_transcript(data)
            YouTubeComponent->>YouTubeComponent: Generate outputs (DataFrame/Message/Data)
            YouTubeComponent-->>User: Success response
        else No Transcript Found
            YouTubeComponent-->>User: Error: NoTranscriptFound
        else Transcripts Disabled
            YouTubeComponent-->>User: Error: TranscriptsDisabled
        end
    else Invalid Video ID
        YouTubeComponent-->>User: Error: Invalid URL
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Video ID extraction logic: Review regex patterns for handling multiple YouTube URL formats and invalid URL validation
Transcript chunking algorithm: Verify time-based grouping logic and chunk_size_seconds handling are correct
Error handling differentiation: Ensure TranscriptsDisabled, NoTranscriptFound, and generic exceptions are properly caught and reported across all output methods
Data format compatibility: Confirm both old and new transcript segment formats (dict-like vs object-like) are handled correctly throughout
Translation flow: Validate translate parameter usage and fallback behavior in the new API
Test coverage: Check that test mocks accurately reflect new YouTubeTranscriptApi behavior and that all error paths are covered

Suggested labels

bug, size:L, lgtm

Suggested reviewers

jordanrfrazier

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 2 warnings)

Check name	Status	Explanation	Resolution
Test Coverage For New Implementations	❌ Error	The PR introduces significant changes to support youtube-transcript-api v1.0+ but has incomplete test coverage for the regression case identified in the review comment. The reviewer explicitly highlighted that the translation fallback logic fails for videos without English captions and that `api.fetch()` should be replaced with `transcript.fetch()`. Most critically, the review specifically requested "a regression case where no English captions exist" to be added to the test suite. The PR summary shows comprehensive test cases were added for various scenarios, but there is no indication that this specific regression test case covering non-English-only transcripts with translation has been included, which is essential for preventing future regressions of this exact scenario.	The test coverage must be expanded to include a regression test case that explicitly covers translation of transcripts when no English captions are available. This test should verify that the code successfully retrieves a non-English transcript and translates it to the target language. Additionally, the test setup should be updated to mock `transcript.fetch()` and `.translate().fetch()` on Transcript objects rather than `YouTubeTranscriptApi.fetch()` to properly align with the v1.x API behavior and ensure the translation path actually returns translated snippets as requested in the review comment.
Test Quality And Coverage	⚠️ Warning	The tests for the YouTube Transcript API v1.0+ upgrade have significant quality and coverage gaps. Most critically, the tests are mocking at the wrong level and not validating the actual API contract. According to the review comments and the youtube-transcript-api v1.0 documentation, the correct pattern for v1.0+ is to call `transcript.fetch()` on Transcript objects (and `translated_transcript.fetch()` after translation), not `YouTubeTranscriptApi.fetch()`. However, the tests appear to be mocking `YouTubeTranscriptApi.fetch()` directly rather than mocking the `Transcript.fetch()` method that the implementation should be calling. This means the tests could pass even if the implementation doesn't actually work with the real v1.0 API. Furthermore, the tests lack coverage for the critical translation fallback logic mentioned in the review comments—specifically testing scenarios where no English captions exist but other languages do, and where the translation path must work correctly. The tests also do not validate that error handling distinguishes between different failure modes (TranscriptsDisabled vs. NoTranscriptFound vs. other exceptions) or test the regression case where a video has no English captions but can be translated from another language.	Update the tests to mock `Transcript.fetch()` and `.translate().fetch()` instead of `YouTubeTranscriptApi.fetch()` to properly validate the implementation matches the v1.0 API contract. Add specific test cases for the translation fallback logic: scenarios where only non-English captions exist and must be translated, cases where both generated and manual transcripts are available, and regression tests ensuring videos with only auto-generated non-English transcripts can still be translated. Ensure error handling tests comprehensively cover the distinct exception paths (TranscriptsDisabled, NoTranscriptFound, etc.) to validate that the current implementation properly implements the suggested patch from the review comments.
Excessive Mock Usage Warning	⚠️ Warning	The PR summary indicates that new mock classes and fixtures have been introduced (FetchedTranscriptSnippetMock, mock_transcript_list) with patches for YouTubeTranscriptApi and related methods to test various code paths including error scenarios, translation, and chunking. While mocking external API dependencies is generally appropriate for unit tests, the critical review comment reveals a significant implementation issue: the current translation fallback logic is incorrect and the code calls `YouTubeTranscriptApi.fetch()` instead of `transcript.fetch()`. The reviewer explicitly states "Once this is in place, update the unit tests to stub `Transcript.fetch()` (and `.translate().fetch()` in the translation case) instead of `YouTubeTranscriptApi.fetch`", suggesting the current test mock structure may be misaligned with how the API should be used. This indicates that excessive or improperly layered mocking may be masking architectural problems rather than testing the correct implementation logic.	The PR should not be merged in its current state regarding mock usage. The implementation must first be corrected to match the reviewer's suggested patch (proper translation fallback chain and using `transcript.fetch()` instead of `YouTubeTranscriptApi.fetch()`). Once the implementation is fixed, the test mocks must be restructured accordingly to stub the correct API layer (Transcript object methods rather than YouTubeTranscriptApi methods). This will eliminate mock layering issues and ensure tests are validating actual component behavior rather than working around incorrect implementation patterns.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix: Upgrade to youtube-transcript-api v1.0+" directly corresponds to the primary change in the changeset: updating the youtube-transcript-api dependency from 0.6.3 to >=1.0.0,<2.0.0 in pyproject.toml. The title is concise, clear, and specific enough that a teammate scanning git history would immediately understand the main change. While the changeset includes substantial refactoring of the YouTubeTranscriptsComponent and tests, these are secondary consequences of supporting the new API version, not the primary driver of the PR. The title accurately captures the root cause and primary intent without being misleading or overly vague.
Docstring Coverage	✅ Passed	Docstring coverage is 96.15% which is sufficient. The required threshold is 80.00%.
Test File Naming And Structure	✅ Passed	The test file `test_youtube_transcript_component.py` follows proper pytest backend testing conventions with the correct `test_*.py` naming pattern in the appropriate directory structure (`src/backend/tests/unit/components/bundles/youtube/`). According to the PR summary, the tests are comprehensively organized with proper fixtures (mock_transcript_list fixture) and mock classes (FetchedTranscriptSnippetMock) for setup, covering both positive and negative scenarios: video ID extraction from various URL formats and invalid URLs, DataFrame/Message/Data outputs, error handling (TranscriptsDisabled, NoTranscriptFound, general exceptions), empty transcript scenarios, chunking behavior with configurable parameters, and translation support. The test cases address edge cases and error conditions as required, with logical organization of related functionality being tested.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-30T20:37:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 31.30%. Comparing base (0cfb4fd) to head (e54ed79).
⚠️ Report is 1 commits behind head on main.

❌ Your project status has failed because the head coverage (39.37%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10456      +/-   ##
==========================================
- Coverage   31.32%   31.30%   -0.02%     
==========================================
  Files        1324     1324              
  Lines       59920    59920              
  Branches     8966     8966              
==========================================
- Hits        18769    18760       -9     
- Misses      40254    40263       +9     
  Partials      897      897

Flag	Coverage Δ
backend	`50.87% <ø> (-0.06%)`	⬇️
frontend	`13.31% <ø> (ø)`
lfx	`39.37% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json (1)

1870-2065: Mirror the transcript fix in the serialized component

This JSON embeds the same YouTubeTranscriptsComponent source, so it still has the hard-coded English lookup and api.fetch(...) call. Please apply the identical fallback + transcript.fetch() change here (and keep the string literal in sync with the Python module) so flows created from this starter project get the corrected behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24c387 and b05c75b.

📒 Files selected for processing (4)

pyproject.toml (1 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json (3 hunks)
src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py (6 hunks)
src/lfx/src/lfx/components/youtube/youtube_transcripts.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

src/backend/tests/unit/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/tests/unit/components/**/*.py: Mirror the component directory structure for unit tests in src/backend/tests/unit/components/
Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests
Provide file_names_mapping for backward compatibility in component tests
Create comprehensive unit tests for all new components

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

src/backend/tests/unit/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

Test component integration within flows using create_flow, build_flow, and get_build_events utilities

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

src/backend/tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)

src/backend/tests/**/*.py: Unit tests for backend code must be located in the 'src/backend/tests/' directory, with component tests organized by component subdirectory under 'src/backend/tests/unit/components/'.
Test files should use the same filename as the component under test, with an appropriate test prefix or suffix (e.g., 'my_component.py' → 'test_my_component.py').
Use the 'client' fixture (an async httpx.AsyncClient) for API tests in backend Python tests, as defined in 'src/backend/tests/conftest.py'.
When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.
Each test in backend Python test files should have a clear docstring explaining its purpose, and complex setups or mocks should be well-commented.
Test both sync and async code paths in backend Python tests, using '@pytest.mark.asyncio' for async tests.
Mock external dependencies appropriately in backend Python tests to isolate unit tests from external services.
Test error handling and edge cases in backend Python tests, including using 'pytest.raises' and asserting error messages.
Validate input/output behavior and test component initialization and configuration in backend Python tests.
Use the 'no_blockbuster' pytest marker to skip the blockbuster plugin in tests when necessary.
Be aware of ContextVar propagation in async tests; test both direct event loop execution and 'asyncio.to_thread' scenarios to ensure proper context isolation.
Test error handling by mocking internal functions using monkeypatch in backend Python tests.
Test resource cleanup in backend Python tests by using fixtures that ensure proper initialization and cleanup of resources.
Test timeout and performance constraints in backend Python tests using 'asyncio.wait_for' and timing assertions.
Test Langflow's Messag...

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

src/backend/**/*component*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

src/backend/**/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/{test_*.py,*.test.ts,*.test.tsx}: Check if tests have too many mock objects that obscure what’s actually being tested
Warn when mocks are used instead of testing real behavior and interactions
Suggest using real objects or simpler test doubles when mocks become excessive
Ensure mocks are used only for external dependencies, not core business logic
Recommend integration tests when unit tests become overly mocked
Check that test files follow the project’s naming conventions (backend: test_*.py; frontend: *.test.ts/tsx)
Verify that tests actually exercise the new or changed functionality, not placeholder assertions
Test files should have descriptive test function names explaining what is being tested
Organize tests logically with proper setup and teardown
Include edge cases and error conditions for comprehensive coverage
Verify tests cover both positive (success) and negative (failure) scenarios
Ensure tests are not mere smoke tests; they should validate behavior thoroughly
Ensure tests follow the project’s testing frameworks (pytest for backend, Playwright for frontend)

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

**/test_*.py

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/test_*.py: Backend tests must be named test_*.py and use proper pytest structure (fixtures, assertions)
For async backend code, use proper pytest async patterns (e.g., pytest-asyncio)
For API endpoints, include tests for both success and error responses

Files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

🧠 Learnings (5)

📚 Learning: 2025-07-21T14:16:14.125Z

Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : Mock external dependencies appropriately in backend Python tests to isolate unit tests from external services.

Applied to files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

📚 Learning: 2025-07-21T14:16:14.125Z

Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : Use 'MockLanguageModel' for testing language model components without external API calls in backend Python tests.

Applied to files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

📚 Learning: 2025-08-05T22:51:27.961Z

Learnt from: edwinjosechittilappilly
PR: langflow-ai/langflow#0
File: :0-0
Timestamp: 2025-08-05T22:51:27.961Z
Learning: The TestComposioComponentAuth test in src/backend/tests/unit/components/bundles/composio/test_base_composio.py demonstrates proper integration testing patterns for external API components, including real API calls with mocking for OAuth completion, comprehensive resource cleanup, and proper environment variable handling with pytest.skip() fallbacks.

Applied to files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

📚 Learning: 2025-07-18T18:25:54.486Z

Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-07-18T18:25:54.486Z
Learning: Applies to src/backend/tests/unit/components/**/*.py : Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests

Applied to files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

📚 Learning: 2025-07-21T14:16:14.125Z

Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.

Applied to files:

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py

🧬 Code graph analysis (2)

src/lfx/src/lfx/components/youtube/youtube_transcripts.py (2)

src/lfx/src/lfx/schema/message.py (1)

Message (34-299)

src/lfx/src/lfx/schema/data.py (1)

Data (26-288)

src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py (1)

src/lfx/src/lfx/components/youtube/youtube_transcripts.py (4)

_extract_video_id (51-62)

get_dataframe_output (147-165)

get_message_output (167-179)

get_data_output (181-206)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: codecov/project/lfx
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
GitHub Check: Lint Backend / Run Mypy (3.10)
GitHub Check: Lint Backend / Run Mypy (3.11)
GitHub Check: Lint Backend / Run Mypy (3.12)
GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
GitHub Check: Test Starter Templates
GitHub Check: Update Component Index
GitHub Check: test-starter-projects
GitHub Check: Optimize new Python code in this PR

coderabbitai · 2025-10-30T20:43:42Z

+            if self.translation:
+                # Get any available transcript and translate it
+                transcript = transcript_list.find_transcript(["en"])
+                transcript = transcript.translate(self.translation)
+            else:
+                # Try to get transcript in available languages
+                try:
+                    transcript = transcript_list.find_transcript(["en"])
+                except NoTranscriptFound:
+                    # Try auto-generated English
+                    transcript = transcript_list.find_generated_transcript(["en"])
+
+            # Fetch the transcript data
+            transcript_data = api.fetch(transcript.video_id, [transcript.language_code])
+
+        except (TranscriptsDisabled, NoTranscriptFound) as e:
+            error_type = type(e).__name__
+            msg = (
+                f"Could not retrieve transcripts for video '{video_id}'. "
+                "Possible reasons:\n"
+                "1. This video does not have captions/transcripts enabled\n"
+                "2. The video is private, restricted, or deleted\n"
+                f"\nTechnical error ({error_type}): {e}"
+            )
+            raise RuntimeError(msg) from e
+        except Exception as e:
+            error_type = type(e).__name__
+            msg = (
+                f"Could not retrieve transcripts for video '{video_id}'. "
+                "Possible reasons:\n"
+                "1. This video does not have captions/transcripts enabled\n"
+                "2. The video is private, restricted, or deleted\n"
+                "3. YouTube is blocking automated requests\n"
+                f"\nTechnical error ({error_type}): {e}"
+            )
+            raise RuntimeError(msg) from e
+
+        if as_chunks:
+            # Group into chunks based on chunk_size_seconds
+            return self._chunk_transcript(transcript_data)
+        # Return as continuous text


⚠️ Potential issue | 🔴 Critical

Fix translation fallback and fetching logic

When translation is set we always look for English captions, so videos that only expose (say) Portuguese captions still blow up with NoTranscriptFound even though they can be translated. On top of that we call YouTubeTranscriptApi.fetch(...) after translate(...), but v1.x expects you to call transcript.fetch() on the Transcript (or translated transcript) object—otherwise the translated snippets are never retrieved.

Please select an English transcript if it exists, fall back to generated-English, then fall back to any available transcript before translating, and use transcript.fetch() so the translation path actually returns translated snippets. That change also keeps the non-translation branch working for videos whose only captions are auto-generated in another language. Suggested patch:

- if self.translation: - # Get any available transcript and translate it - transcript = transcript_list.find_transcript(["en"]) - transcript = transcript.translate(self.translation) - else: - # Try to get transcript in available languages - try: - transcript = transcript_list.find_transcript(["en"]) - except NoTranscriptFound: - # Try auto-generated English - transcript = transcript_list.find_generated_transcript(["en"]) - - # Fetch the transcript data - transcript_data = api.fetch(transcript.video_id, [transcript.language_code]) + if self.translation: + try: + transcript = transcript_list.find_transcript(["en"]) + except NoTranscriptFound as en_error: + try: + transcript = transcript_list.find_generated_transcript(["en"]) + except NoTranscriptFound: + try: + transcript = transcript_list[0] + except IndexError as exc: # pragma: no cover + raise en_error from exc + transcript = transcript.translate(self.translation) + else: + try: + transcript = transcript_list.find_transcript(["en"]) + except NoTranscriptFound as en_error: + try: + transcript = transcript_list.find_generated_transcript(["en"]) + except NoTranscriptFound: + try: + transcript = transcript_list[0] + except IndexError as exc: # pragma: no cover + raise en_error from exc + + transcript_data = transcript.fetch()

Once this is in place, update the unit tests to stub Transcript.fetch() (and .translate().fetch() in the translation case) instead of YouTubeTranscriptApi.fetch, and add a regression case where no English captions exist so we keep this flow covered.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.translation:

# Get any available transcript and translate it

transcript = transcript_list.find_transcript(["en"])

transcript = transcript.translate(self.translation)

else:

# Try to get transcript in available languages

try:

transcript = transcript_list.find_transcript(["en"])

except NoTranscriptFound:

# Try auto-generated English

transcript = transcript_list.find_generated_transcript(["en"])

# Fetch the transcript data

transcript_data = api.fetch(transcript.video_id, [transcript.language_code])

except (TranscriptsDisabled, NoTranscriptFound) as e:

error_type = type(e).__name__

msg = (

f"Could not retrieve transcripts for video '{video_id}'. "

"Possible reasons:\n"

"1. This video does not have captions/transcripts enabled\n"

"2. The video is private, restricted, or deleted\n"

f"\nTechnical error ({error_type}): {e}"

)

raise RuntimeError(msg) from e

except Exception as e:

error_type = type(e).__name__

msg = (

f"Could not retrieve transcripts for video '{video_id}'. "

"Possible reasons:\n"

"1. This video does not have captions/transcripts enabled\n"

"2. The video is private, restricted, or deleted\n"

"3. YouTube is blocking automated requests\n"

f"\nTechnical error ({error_type}): {e}"

)

raise RuntimeError(msg) from e

if as_chunks:

# Group into chunks based on chunk_size_seconds

return self._chunk_transcript(transcript_data)

# Return as continuous text

if self.translation:

try:

transcript = transcript_list.find_transcript(["en"])

except NoTranscriptFound as en_error:

try:

transcript = transcript_list.find_generated_transcript(["en"])

except NoTranscriptFound:

try:

transcript = transcript_list[0]

except IndexError as exc: # pragma: no cover

raise en_error from exc

transcript = transcript.translate(self.translation)

else:

try:

transcript = transcript_list.find_transcript(["en"])

except NoTranscriptFound as en_error:

try:

transcript = transcript_list.find_generated_transcript(["en"])

except NoTranscriptFound:

try:

transcript = transcript_list[0]

except IndexError as exc: # pragma: no cover

raise en_error from exc

transcript_data = transcript.fetch()

except (TranscriptsDisabled, NoTranscriptFound) as e:

error_type = type(e).__name__

msg = (

f"Could not retrieve transcripts for video '{video_id}'. "

"Possible reasons:\n"

"1. This video does not have captions/transcripts enabled\n"

"2. The video is private, restricted, or deleted\n"

f"\nTechnical error ({error_type}): {e}"

)

raise RuntimeError(msg) from e

except Exception as e:

error_type = type(e).__name__

msg = (

f"Could not retrieve transcripts for video '{video_id}'. "

"Possible reasons:\n"

"1. This video does not have captions/transcripts enabled\n"

"2. The video is private, restricted, or deleted\n"

"3. YouTube is blocking automated requests\n"

f"\nTechnical error ({error_type}): {e}"

)

raise RuntimeError(msg) from e

if as_chunks:

# Group into chunks based on chunk_size_seconds

return self._chunk_transcript(transcript_data)

# Return as continuous text

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/youtube/youtube_transcripts.py around lines 78-118, the logic currently always searches for an English transcript when translation is requested and calls YouTubeTranscriptApi.fetch(...) after translate(), which prevents using transcripts available only in other languages and fails to actually fetch translated snippets; change the selection order to: try find English transcript, then find generated-English, then fall back to any available transcript (transcript_list.find_transcript(None) or equivalent) before translating if self.translation is set; after selecting (and optionally translating) a Transcript object, call transcript.fetch() on that object (or translated_transcript.fetch()) instead of api.fetch(...); finally update unit tests to stub Transcript.fetch() and the translated transcript’s .fetch() (not YouTubeTranscriptApi.fetch), and add a regression test where no English captions exist to cover the fallback-to-any-language-then-translate path.

erichare

LGTM!

github-actions · 2025-11-04T23:45:41Z

Frontend Unit Test Coverage Report

Coverage Summary

Lines	Statements	Branches	Functions
	14.37% (3866/26885)	7.34% (1506/20507)	8.64% (509/5889)

Unit Test Results

Tests	Skipped	Failures	Errors	Time
1549	0 💤	0 ❌	0 🔥	18.046s ⏱️

* fix youtube transcript api * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

fix youtube transcript api

b05c75b

Cristhianzl requested a review from edwinjosechittilappilly October 30, 2025 20:35

Cristhianzl self-assigned this Oct 30, 2025

github-actions Bot added the bug Something isn't working label Oct 30, 2025

[autofix.ci] apply automated fixes

8fd8a6e

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Oct 30, 2025

[autofix.ci] apply automated fixes (attempt 2/3)

c058697

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Oct 30, 2025

coderabbitai Bot reviewed Oct 30, 2025

View reviewed changes

erichare approved these changes Nov 4, 2025

View reviewed changes

merge fix

af4684f

Cristhianzl enabled auto-merge November 4, 2025 23:39

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025

[autofix.ci] apply automated fixes

59ce487

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025

[autofix.ci] apply automated fixes (attempt 2/3)

e54ed79

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025

Cristhianzl added this pull request to the merge queue Nov 5, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Nov 5, 2025

Cristhianzl added this pull request to the merge queue Nov 5, 2025

Merged via the queue into main with commit 64b593b Nov 5, 2025
81 of 82 checks passed

Cristhianzl deleted the cz/fix-youtube-transcript-version branch November 5, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Upgrade to youtube-transcript-api v1.0+#10456

fix: Upgrade to youtube-transcript-api v1.0+#10456
Cristhianzl merged 6 commits into
mainfrom
cz/fix-youtube-transcript-version

Cristhianzl commented Oct 30, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Oct 30, 2025 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Oct 30, 2025

Uh oh!

erichare left a comment

Uh oh!

github-actions Bot commented Nov 4, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Cristhianzl commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

codecov Bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

erichare left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 4, 2025

Frontend Unit Test Coverage Report

Coverage Summary

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cristhianzl commented Oct 30, 2025 •

edited

Loading

coderabbitai Bot commented Oct 30, 2025 •

edited

Loading

codecov Bot commented Oct 30, 2025 •

edited

Loading